Contributor Concepts

Some concepts in RAIL are only exposed in development, and their purpose and usage are detailed below.

Data Concepts

In developing RAIL, data is used and exposed in structures that aren’t available to the user-facing usage of RAIL (interactive and pipeline modes). These data structures are described below.

tables_io

tables_io provides an interface for working with table data from a variety of non-ASCII file formats, including fits, hdf5, parquet, and tabular formats from astropy, pandas, pyarrow, and numpy. It allows for chunked reading of some file formats for large data.

For further reading, visit the tables_io documentation.

qp Ensemble

Redshift data products may take many forms; probability density functions (PDFs) characterizing the redshift distribution of a sample of galaxies or each galaxy individually are defined by values of parameters under a choice of parameterization. To enable parameterization-agnostic downstream analyses, the qp package provides a shared interface to many parameterizations of univariate PDFs and utilities for performing conversions, evaluating metrics, and executing at-scale input-output operations. RAIL stages provide and/or ingest their photo-z data products as qp.Ensemble objects, both for collections of individual galaxies and for the summarized redshift distribution of samples of galaxies (such as members of a tomographic bin or galaxy cluster members). The key features of a qp.Ensemble are the metadata of the type of parameterization and defining parameters shared by the entire ensemble, the objdata values unique to each row-wise member of the ensemble that specify its PDF given the metadata, and the ancil information associated to each row-wise member that isn’t part of the parameterized PDF.

For further reading, visit the qp documentation.

Data Handles

One particularity of RAIL is that we wrap data in rail.core.DataHandle objects rather than passing the data directly to functions. There are a few reasons for this.

Potentially large data volume

One of the challenges that RAIL must address is the potentially very large datasets that we use. At times we will be dealing with billions of objects, and will not be able to load the object tables into the memory of a single processor.

Parallel processing

DataHandle Class

rail.core.DataHandle is the class that lets users connect data to RAIL.

class rail.core.DataHandle

Class to act as a handle for a bit of data. Associating it with a file and providing tools to read & write it to that file

__init__(tag, data=None, path=None, creator=None)

Constructor

Parameters:
  • tag (str) – The tag under which this data handle can be found in the store

  • data (DataLike | None) – The associated data

  • path (str | None) – The path to the associated file

  • creator (str | None) – The name of the stage that created this data handle

Return type:

None

classmethod __new__(*args, **kwargs)

Basic file-like operations

DataHandle.open(**kwargs)

Open and return the associated file

Parameters:

**kwargs (Any) – Passed to the call to open the file in question

Returns:

Newly opened file

Return type:

FileLike

Notes

This will simply open the file and return a FileLike object to the caller. It will not read or cache the data

DataHandle.close(**kwargs)

Close the associated file

Parameters:

kwargs (Any)

Return type:

None

DataHandle.read(force=False, **kwargs)

Read and return the data from the associated file

Parameters:
  • force (bool) – If true, force re-reading the data

  • **kwargs (Any) – Passed to the call to read the data

Returns:

Data that were read

Return type:

DataLike

Notes

This will read the entire file, and while useful for testing on small files, will not work on very large files.

DataHandle.write(**kwargs)

Write the data to the associated file

Parameters:

kwargs (Any)

Return type:

None

Operations for parallelized access to data

DataHandle.iterator(**kwargs)

Iterator over the data

Parameters:

kwargs (Any)

Return type:

Iterable

DataHandle.size(**kwargs)

Return the size of the data associated to this handle

Parameters:

kwargs (Any)

Return type:

int

DataHandle.data_size(**kwargs)

Return the size of the in memory data

Parameters:

kwargs (Any)

Return type:

int

DataHandle.initialize_write(data_length, **kwargs)

Initialize file to be written by chunks

Parameters:
  • data_length (int) – Number of rows of data that we will write, used to reserve space

  • **kwargs (Any) – Information about the columns we will write

Return type:

None

DataHandle.write_chunk(start, end, **kwargs)

Write the data to the associated file

Parameters:
  • start (int) – Index of starting row for this chunk of data

  • end (int) – Index of ending row for this chunk of data

  • **kwargs (Any) – Passed to call to write this chunk of data

Return type:

None

DataHandle.finalize_write(**kwargs)

Finalize and close file written by chunks

Parameters:

**kwargs (Any) – Passed to call to write this chunk of data

Return type:

None

DataHandle.iterator(**kwargs)

Iterator over the data

Parameters:

kwargs (Any)

Return type:

Iterable

DataHandle.size(**kwargs)

Return the size of the data associated to this handle

Parameters:

kwargs (Any)

Return type:

int

Functions for working with DataHandles

DataHandle.set_data(data, partial=False)

Set the data for a chunk, and set the partial flag to true

Parameters:
  • data (rail.core.data.DataLike)

  • partial (bool)

Return type:

None

classmethod DataHandle.make_name(tag)

Construct and return file name for a particular data tag

Parameters:

tag (str)

Return type:

str

Ephemeral Data Stores

rail.core.DataStore is the class that is used by RAIL stages to keep track of data within a stage. Each stage should have its own DataStore, and other stages cannot access that DataStore.

class rail.core.DataStore

Class to provide a transient data store

This class:

  1. associates data products with keys

  2. provides functions to read and write the various data produces to associated files

__init__(**kwargs)

Build from keywords

Note

All of the values must be data handles or this will raise a TypeError

Parameters:

kwargs (Any)

Return type:

None

classmethod __new__(*args, **kwargs)

DataStore Functionality

DataStore.add_handle(key, handle_class, path, creator='DataStore')

Create a handle for some data, and insert it into the DataStore

Parameters:
  • key (str)

  • handle_class (type[DataHandle])

  • path (str)

  • creator (str)

Return type:

DataHandle

DataStore.read_file(key, handle_class, path, creator='DataStore', **kwargs)

Create a handle, use it to read a file, and insert it into the DataStore

Parameters:
  • key (str)

  • handle_class (type[DataHandle])

  • path (str)

  • creator (str)

  • kwargs (Any)

Return type:

DataHandle

DataStore.read(key, force=False, **kwargs)

Read the data associated to a particular key

Parameters:
  • key (str)

  • force (bool)

  • kwargs (Any)

Return type:

rail.core.data.DataLike

DataStore.open(key, mode='r', **kwargs)

Open and return the file associated to a particular key

Parameters:
  • key (str)

  • mode (str)

  • kwargs (Any)

Return type:

rail.core.data.FileLike

DataStore.write(key, **kwargs)

Write the data associated to a particular key

Parameters:
  • key (str)

  • kwargs (Any)

Return type:

None

DataStore.write_all(force=False, **kwargs)

Write all the data in this DataStore

Parameters:
  • force (bool)

  • kwargs (Any)

Return type:

None

Shared Parameters

RAIL is designed to be used with a variety of different data. Depending on the data in question, things like the names of the columns associated to the particular quantities like the true redshift of a simulated object, or the names of the columns with the various observed magnitudes in different filters, will vary. By enforcing consistency in naming conventions between different RailStage sub-classes we have made it simple to configure RAIL to read data from a particular source, rather than having to edit the configurations for many different RailStages.

When using a single stage (e.g. testing an algorithm in a Jupyter notebook), it is also possible to overwrite the default settings for the input data directly for the stage, without involving the shared parameters, by simply specifying catalog information in the make_stage step. For example, your input catalog may have band names like “{band}_gaap1p0Mag”, which is different from the default values in RAIL. To set this in MyFavouriteInformer, do:

MyFavouriteEstimator.make_stage(band = [f"{band}_gaap1p0Mag" for band in "ugrizy"])

Note that typically a stage may require changes in multiple input parameters (e.g. err_bands and ref_bands needs to be changed accordingly). Note also that if the user wants to run MyFavouriteEstimator next, they will need to repeat this for the make_stage for the estimator. This is why, in case the user is running many stages, using shared parameters below are preferred.

class rail.core.common_params.SharedParams

Bases: object

Parameters:
  • hdf5_groupname ([str] default=photometry) – name of hdf5 group for data, if None, then set to ‘’

  • chunk_size ([int] default=10000) – Number of objects per chunk for parallel processing or to evalute per loop in single node processing

  • zmin ([float] default=0.0) – The minimum redshift of the z grid or sample

  • zmax ([float] default=3.0) – The maximum redshift of the z grid or sample

  • nzbins ([int] default=301) – The number of gridpoints in the z grid

  • dz ([float] default=0.01) – delta z in grid

  • nondetect_val ([float] default=99.0) – value to be replaced with magnitude limit for non detects

  • nonobserved_val ([float] default=-99.0) – guard value for non-observations

  • bands ([list] default=['mag_u_lsst', 'mag_g_lsst', 'mag_r_lsst', 'mag_i_lsst', 'mag_z_lsst', 'mag_y_lsst']) – Names of columns for magnitude by filter band

  • err_bands ([list] default=['mag_err_u_lsst', 'mag_err_g_lsst', 'mag_err_r_lsst', 'mag_err_i_lsst', 'mag_err_z_lsst', 'mag_err_y_lsst']) – Names of columns for magnitude errors by filter band

  • err_dict ([dict] default={'mag_u_lsst': 'mag_err_u_lsst', 'mag_g_lsst': 'mag_err_g_lsst', 'mag_r_lsst': 'mag_err_r_lsst', 'mag_i_lsst': 'mag_err_i_lsst', 'mag_z_lsst': 'mag_err_z_lsst', 'mag_y_lsst': 'mag_err_y_lsst', 'redshift': None}) – dictionary that contains the columns that will be used topredict as the keys and the errors associated with that column as the values.If a column does not havea an associated error its value shoule be None

  • mag_limits ([dict] default={'mag_u_lsst': 27.79, 'mag_g_lsst': 29.04, 'mag_r_lsst': 29.06, 'mag_i_lsst': 28.62, 'mag_z_lsst': 27.98, 'mag_y_lsst': 27.05}) – Limiting magnitudes by filter

  • band_a_env ([dict] default={'mag_u_lsst': 4.81, 'mag_g_lsst': 3.64, 'mag_r_lsst': 2.7, 'mag_i_lsst': 2.06, 'mag_z_lsst': 1.58, 'mag_y_lsst': 1.31}) – Reddening parameters

  • ref_band ([str] default=mag_i_lsst) – band to use in addition to colors

  • redshift_col ([str] default=redshift) – name of redshift column

  • id_col ([str] default=object_id) – name of the object ID column

  • object_id_col ([str] default=objectId) – name of object id column

  • zp_errors ([list] default=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1]) – BPZ adds these values in quadrature to the photometric errors

  • calc_summary_stats ([bool] default=False) – Compute summary statistics

  • calculated_point_estimates ([list] default=[]) – List of strings defining which point estimates to automatically calculate using qp.Ensemble.Options include, ‘mean’, ‘mode’, ‘median’.

  • recompute_point_estimates ([bool] default=False) – Force recomputation of point estimates

  • replace_error_vals ([list] default=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1]) – list of values to replace negative and nan mag err values

  • filter_list ([list] default=['DC2LSST_u', 'DC2LSST_g', 'DC2LSST_r', 'DC2LSST_i', 'DC2LSST_z', 'DC2LSST_y']) – list of filter files names (with no ‘.sed’ suffix). Filters must bein FILTER dir. MUST BE IN SAME ORDER as ‘bands’

  • leaf_size ([int] default=15) – The leaf size for tree algorithms.

  • max_wavelength ([float] default=12000) – The maximum rest-frame wavelength

  • min_wavelength ([float] default=250) – The minimum rest-frame wavelength.

  • redshift_key ([str] default=redshifts) – The keyword of the redshift group in the hdf5 dataset.

static copy_param(param_name)

Return a copy of one of the shared parameters

Parameters:

param_name (str) – Name of the parameter to copy

Returns:

Copied parameter

Return type:

Param

static set_param_default(param_name, default_value)

Change the default value of one of the shared parameters

Parameters:
  • param_name (str) – Name of the parameter to copy

  • default_value (Any) – New default value

Return type:

None

static set_param_defaults(**kwargs)

Change the default value of several of the shared parameters

Parameters:

**kwargs (Any) – Key, value pairs of parameter names and default values

Return type:

None

Catalog Tags

rail.utils.catalog_utils.CatalogConfigBase provides an interface to switch between different input catalogs.

class rail.utils.catalog_utils.CatalogConfigBase

Class that wraps the settings of shared configuration parameters needed work with the particular column names in a given catalog type

__init__()
classmethod __new__(*args, **kwargs)
classmethod CatalogConfigBase.apply(tag)

Activate a particular tag

Parameters:

tag (str)

Return type:

None

classmethod CatalogConfigBase.active_class()

Return the currently active class

Return type:

type[T] | None

classmethod CatalogConfigBase.active_tag()

Return the currently active tag

Return type:

str | None

Developing the Interactive Module

The contents of rail_base/src/rail/interactive should not be manually edited. The files in this folder are generated by the script create_interactive_structure.py in rail_base. That script contains a list of high-level submodules of RAIL, which will have __init__.py files created within the interactive directory.

Instead, modifications should be made either to that creation script, or to the functions in the utility folder rail_base/src/rail/utils/interactive.

All subclasses of RailStage are considered valid candidates to have interative functions, and it is thus an error for a RailStage to lack the required infrastructure to have an interactive function generated. See the notes on adding new stages regarding requirements and best practices when writing RAIL stages to work well with the interactive module.