Running RAIL with a different dataset

Authors: Sam Schmidt

Last run successfully: Feb 9, 2026

This is a notebook with a quick example of running a rail algoritm with a different dataset and overriding configuration parameters.

Most of our other demo notebooks use small datasets included with the RAIL demo package, all with the same input names. These datasets are named consistently with many of the default parameter values used in RAIL, e.g. hdf5_groupname="photometry" and ugrizy photometry named in a pattern "mag_{band}_lsst", often specified in SHARED_PARAMS.

This notebook will just show a quick run with an alternate dataset, showing the values that users will likely need to change in order to get things running.

Note: If you’re interested in running this in pipeline mode, see 16_Running_with_different_data.ipynb in the pipeline_examples/estimation_examples/ folder.

import os
from pathlib import Path

import matplotlib.pyplot as plt
import tables_io

from rail import interactive as ri

DOWNLOADS_DIR = Path("../examples_data")
DOWNLOADS_DIR.mkdir(exist_ok=True)

Install FSPS with the following commands:
pip uninstall fsps
git clone --recursive https://github.com/dfm/python-fsps.git
cd python-fsps
python -m pip install .
export SPS_HOME=$(pwd)/src/fsps/libfsps

LEPHAREDIR is being set to the default cache directory:
/home/runner/.cache/lephare/data
More than 1Gb may be written there.
LEPHAREWORK is being set to the default cache directory:
/home/runner/.cache/lephare/work
Default work cache is already linked.
This is linked to the run directory:
/home/runner/.cache/lephare/runs/20260504T123336

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 758, in start
    self.io_loop.start()
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 211, in start
    self.asyncio_loop.run_forever()
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/utils.py", line 71, in preserve_context
    return await f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 621, in shell_main
    await self.dispatch_shell(msg, subshell_id=subshell_id)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 478, in dispatch_shell
    await result
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 372, in execute_request
    await super().execute_request(stream, ident, parent)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 834, in execute_request
    reply_content = await reply_content
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 464, in do_execute
    res = shell.run_cell(
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 663, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3077, in run_cell
    result = self._run_cell(
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3132, in _run_cell
    result = runner(coro)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner
    coro.send(None)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3336, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3519, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3579, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_4826/3836591137.py", line 7, in <module>
    from rail import interactive as ri
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/interactive/__init__.py", line 3, in <module>
    from . import calib, creation, estimation, evaluation, tools
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/interactive/calib/__init__.py", line 3, in <module>
    from rail.utils.interactive.initialize_utils import _initialize_interactive_module
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/utils/interactive/initialize_utils.py", line 17, in <module>
    from rail.utils.interactive.base_utils import (
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/utils/interactive/base_utils.py", line 10, in <module>
    rail.stages.import_and_attach_all(silent=True)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/stages/__init__.py", line 74, in import_and_attach_all
    RailEnv.import_all_packages(silent=silent)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/core/introspection.py", line 541, in import_all_packages
    _imported_module = importlib.import_module(pkg)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/som/__init__.py", line 1, in <module>
    from rail.creation.degraders.specz_som import *
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/creation/degraders/specz_som.py", line 15, in <module>
    from somoclu import Somoclu
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/somoclu/__init__.py", line 11, in <module>
    from .train import Somoclu
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/somoclu/train.py", line 25, in <module>
    from .somoclu_wrap import train as wrap_train
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/somoclu/somoclu_wrap.py", line 11, in <module>
    import _somoclu_wrap

---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

File /opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/numpy/core/_multiarray_umath.py:44, in __getattr__(attr_name)
     39     # Also print the message (with traceback).  This is because old versions
     40     # of NumPy unfortunately set up the import to replace (and hide) the
     41     # error.  The traceback shouldn't be needed, but e.g. pytest plugins
     42     # seem to swallow it and we should be failing anyway...
     43     sys.stderr.write(msg + tb_msg)
---> 44     raise ImportError(msg)
     46 ret = getattr(_multiarray_umath, attr_name, None)
     47 if ret is None:


ImportError:
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Warning: the binary library cannot be imported. You cannot train maps, but you can load and analyze ones that you have already saved.
The problem occurs because either compilation failed when you installed Somoclu or a path is missing from the dependencies when you are trying to import it. Please refer to the documentation to see your options.

First, we’ll start with grabbing some small datasets from NERSC, a tar file with some data drawn from the Roman-Rubin simulation:

training_file = DOWNLOADS_DIR / "romanrubin_demo_data.tar"

if not os.path.exists(training_file):
    os.system(
        f"curl -O https://portal.nersc.gov/cfs/lsst/PZ/romanrubin_demo_data.tar --create-dirs --output-dir {DOWNLOADS_DIR}"
    )
os.system(f"tar -xvf {training_file} --directory {DOWNLOADS_DIR}")

    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                   Dload  Upload   Total   Spent    Left  Speed

0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

romanrubin_train_data.hdf5
romanrubin_test_data.hdf5

34 4670k   34 1611k    0     0  1969k      0  0:00:02 --:--:--  0:00:02 1967k

100 4670k 100 4670k 0 0 4687k 0 –:–:– –:–:– –:–:– 4684k

Let’s load one of the files and look at the contents:

trainFile = DOWNLOADS_DIR / "romanrubin_train_data.hdf5"
training_data = tables_io.read(trainFile)
training_data.keys()

odict_keys(['H', 'H_err', 'J', 'J_err', 'g', 'g_err', 'i', 'i_err', 'r', 'r_err', 'redshift', 'u', 'u_err', 'y', 'y_err', 'z', 'z_err'])

We can see that, unlike the demo data in other notebooks, there is no top level hdf5_groupname of “photometry”, the data is directly in the top level of the hdf5 file. As such, we will need to specify hdf5_groupname="" to override the default value of "photometry" in RAIL.

We also see that the magnitudes and errors are simply named with the band name, e.g. "u" rather than "mag_u_lsst". Again, we will need to specify the band and error names in order to override the defaults in RAIL. Let’s do that below, using the KNearNeighInformer and Estimator algorithms:

testFile = DOWNLOADS_DIR / "romanrubin_test_data.hdf5"
test_data = tables_io.read(testFile)

The dataset-specific parameters

We will need to specify several parameters to override the default values in RAIL, we can create a dictionary of these and pass those into the make_stage for our informer. Because we have Roman J and H, we will also demonstrate running with 8 bands rather than the default six.

RAIL requires that we specify the names of the input columns as bands, and the input errors on those as err_bands. Most algorithms also require a ref_band. To handle non-detections, RAIL uses a dictionary of mag_limits which must contain keys for all of the columns in bands and a float for the value with which the non-detect will be replaced. You may also need to specify a different nondetect_val if the dataset has a different convention for non-detections (in this dataset, our non-detetions have a value of np.inf).

NOTE: RAIL uses SHARED_PARAMS, a central location for specifying a subset of parameters that are common to a dataset, and setting them in one place when running multiple algorithms. However, any configuration parameters specified as SHARED_PARAMS can be overridden in the same way as any other parameter, there is nothing special about them, and we will do that here with bands, err_bands, etc…

Let’s set up our dictionary with these values:

bands = ["u", "g", "r", "i", "z", "y", "J", "H"]
errbands = []
maglims = {}
limvals = [27.8, 29.0, 29.1, 28.6, 28.0, 27.0, 26.4, 26.4]
for band, limval in zip(bands, limvals):
    errbands.append(f"{band}_err")
    maglims[band] = limval


print(bands)
print(errbands)
print(maglims)

['u', 'g', 'r', 'i', 'z', 'y', 'J', 'H']
['u_err', 'g_err', 'r_err', 'i_err', 'z_err', 'y_err', 'J_err', 'H_err']
{'u': 27.8, 'g': 29.0, 'r': 29.1, 'i': 28.6, 'z': 28.0, 'y': 27.0, 'J': 26.4, 'H': 26.4}

knn_dict = dict(
    hdf5_groupname="", bands=bands, err_bands=errbands, mag_limits=maglims, ref_band="i"
)

We can now feed this into our inform stage:

pz_model = ri.estimation.algos.k_nearneigh.k_near_neigh_informer(
    training_data=training_data, **knn_dict
)["model"]

Inserting handle into data store.  input: None, KNearNeighInformer
split into 11250 training and 3750 validation samples
finding best fit sigma and NNeigh...

best fit values are sigma=0.017222222222222222 and numneigh=7

Inserting handle into data store.  model: inprogress_model.pkl, KNearNeighInformer

We can use the same dictionary to specify overrides for the estimator stage:

results = ri.estimation.algos.k_nearneigh.k_near_neigh_estimator(
    input_data=test_data, model=pz_model, **knn_dict
)

Inserting handle into data store.  input: None, KNearNeighEstimator
Inserting handle into data store.  model: {'kdtree': <sklearn.neighbors._kd_tree.KDTree object at 0x562e56708600>, 'bestsig': np.float64(0.017222222222222222), 'nneigh': 7, 'truezs': array([0.61988401, 1.74063779, 1.08068781, ..., 0.25938554, 0.92907312,
       2.84295586], shape=(15000,)), 'only_colors': False}, KNearNeighEstimator
Process 0 running estimator on chunk 0 - 20,000
Process 0 estimating PZ PDF for rows 0 - 20,000

Inserting handle into data store.  output: inprogress_output.hdf5, KNearNeighEstimator

Let’s plot the mode vs the true redshift to make sure that things ran properly:

zmode = results["output"].ancil["zmode"].flatten()

Let’s plot the redshift mode against the true redshifts to see how they look:

plt.figure(figsize=(8, 8))
plt.scatter(test_data["redshift"], zmode, s=1, c="k", label="KNN mode")
plt.plot([0, 3], [0, 3], "r--")
plt.xlabel("true redshift")
plt.ylabel("KNN mode")

Text(0, 0.5, 'KNN mode')

../../../_images/Running_with_different_data_19_1.png

Yes, things look very nice, and the inclusion of NIR photometry gives us very little scatter and very few outliers!

Clean up downloaded files

for file in [training_file, trainFile, testFile]:
    file.unlink()