Contributor Concepts
Some concepts in RAIL are only exposed in development, and their purpose and usage are detailed below.
Data Concepts
In developing RAIL, data is used and exposed in structures that aren’t available to the user-facing usage of RAIL (interactive and pipeline modes). These data structures are described below.
tables_io
tables_io provides an interface for working with table data from a variety
of non-ASCII file formats, including fits, hdf5, parquet, and tabular
formats from astropy, pandas, pyarrow, and numpy. It allows for chunked
reading of some file formats for large data.
For further reading, visit the tables_io documentation.
qp Ensemble
Redshift data products may take many forms; probability density functions (PDFs)
characterizing the redshift distribution of a sample of galaxies or each galaxy
individually are defined by values of parameters under a choice of
parameterization. To enable parameterization-agnostic downstream analyses, the
qp package provides a shared interface to
many parameterizations of univariate PDFs and utilities for performing
conversions, evaluating metrics, and executing at-scale input-output operations.
RAIL stages provide and/or ingest their photo-z data products as qp.Ensemble
objects, both for collections of individual galaxies and for the summarized
redshift distribution of samples of galaxies (such as members of a tomographic
bin or galaxy cluster members). The key features of a qp.Ensemble are the
metadata of the type of parameterization and defining parameters shared by the
entire ensemble, the objdata values unique to each row-wise member of the
ensemble that specify its PDF given the metadata, and the ancil information
associated to each row-wise member that isn’t part of the parameterized PDF.
For further reading, visit the qp documentation.
Data Handles
One particularity of RAIL is that we wrap data in
rail.core.DataHandle objects rather than passing the data directly
to functions. There are a few reasons for this.
Potentially large data volume
One of the challenges that RAIL must address is the potentially very large datasets that we use. At times we will be dealing with billions of objects, and will not be able to load the object tables into the memory of a single processor.
Parallel processing
DataHandle Class
rail.core.DataHandle is the class that lets users connect data to
RAIL.
- class rail.core.DataHandle
Class to act as a handle for a bit of data. Associating it with a file and providing tools to read & write it to that file
- __init__(tag, data=None, path=None, creator=None)
Constructor
- Parameters:
tag (str) – The tag under which this data handle can be found in the store
data (DataLike | None) – The associated data
path (str | None) – The path to the associated file
creator (str | None) – The name of the stage that created this data handle
- Return type:
None
- classmethod __new__(*args, **kwargs)
Basic file-like operations
- DataHandle.open(**kwargs)
Open and return the associated file
- Parameters:
**kwargs (Any) – Passed to the call to open the file in question
- Returns:
Newly opened file
- Return type:
FileLike
Notes
This will simply open the file and return a FileLike object to the caller. It will not read or cache the data
- DataHandle.close(**kwargs)
Close the associated file
- Parameters:
kwargs (Any)
- Return type:
None
- DataHandle.read(force=False, **kwargs)
Read and return the data from the associated file
- Parameters:
force (bool) – If true, force re-reading the data
**kwargs (Any) – Passed to the call to read the data
- Returns:
Data that were read
- Return type:
DataLike
Notes
This will read the entire file, and while useful for testing on small files, will not work on very large files.
- DataHandle.write(**kwargs)
Write the data to the associated file
- Parameters:
kwargs (Any)
- Return type:
None
Operations for parallelized access to data
- DataHandle.iterator(**kwargs)
Iterator over the data
- Parameters:
kwargs (Any)
- Return type:
Iterable
- DataHandle.size(**kwargs)
Return the size of the data associated to this handle
- Parameters:
kwargs (Any)
- Return type:
int
- DataHandle.data_size(**kwargs)
Return the size of the in memory data
- Parameters:
kwargs (Any)
- Return type:
int
- DataHandle.initialize_write(data_length, **kwargs)
Initialize file to be written by chunks
- Parameters:
data_length (int) – Number of rows of data that we will write, used to reserve space
**kwargs (Any) – Information about the columns we will write
- Return type:
None
- DataHandle.write_chunk(start, end, **kwargs)
Write the data to the associated file
- Parameters:
start (int) – Index of starting row for this chunk of data
end (int) – Index of ending row for this chunk of data
**kwargs (Any) – Passed to call to write this chunk of data
- Return type:
None
- DataHandle.finalize_write(**kwargs)
Finalize and close file written by chunks
- Parameters:
**kwargs (Any) – Passed to call to write this chunk of data
- Return type:
None
- DataHandle.iterator(**kwargs)
Iterator over the data
- Parameters:
kwargs (Any)
- Return type:
Iterable
- DataHandle.size(**kwargs)
Return the size of the data associated to this handle
- Parameters:
kwargs (Any)
- Return type:
int
Functions for working with DataHandles
- DataHandle.set_data(data, partial=False)
Set the data for a chunk, and set the partial flag to true
- Parameters:
data (rail.core.data.DataLike)
partial (bool)
- Return type:
None
- classmethod DataHandle.make_name(tag)
Construct and return file name for a particular data tag
- Parameters:
tag (str)
- Return type:
str
Ephemeral Data Stores
rail.core.DataStore is the class that is used by RAIL stages to
keep track of data within a stage. Each stage should have its own DataStore,
and other stages cannot access that DataStore.
- class rail.core.DataStore
Class to provide a transient data store
This class:
associates data products with keys
provides functions to read and write the various data produces to associated files
- __init__(**kwargs)
Build from keywords
Note
All of the values must be data handles or this will raise a TypeError
- Parameters:
kwargs (Any)
- Return type:
None
- classmethod __new__(*args, **kwargs)
DataStore Functionality
- DataStore.add_handle(key, handle_class, path, creator='DataStore')
Create a handle for some data, and insert it into the DataStore
- Parameters:
key (str)
handle_class (type[DataHandle])
path (str)
creator (str)
- Return type:
- DataStore.read_file(key, handle_class, path, creator='DataStore', **kwargs)
Create a handle, use it to read a file, and insert it into the DataStore
- Parameters:
key (str)
handle_class (type[DataHandle])
path (str)
creator (str)
kwargs (Any)
- Return type:
- DataStore.read(key, force=False, **kwargs)
Read the data associated to a particular key
- Parameters:
key (str)
force (bool)
kwargs (Any)
- Return type:
rail.core.data.DataLike
- DataStore.open(key, mode='r', **kwargs)
Open and return the file associated to a particular key
- Parameters:
key (str)
mode (str)
kwargs (Any)
- Return type:
rail.core.data.FileLike
- DataStore.write(key, **kwargs)
Write the data associated to a particular key
- Parameters:
key (str)
kwargs (Any)
- Return type:
None
- DataStore.write_all(force=False, **kwargs)
Write all the data in this DataStore
- Parameters:
force (bool)
kwargs (Any)
- Return type:
None
Developing the Interactive Module
The contents of rail_base/src/rail/interactive should not be manually edited.
The files in this folder are generated by the script create_interactive_structure.py
in rail_base. That script contains a list of high-level submodules of RAIL, which
will have __init__.py files created within the interactive directory.
Instead, modifications should be made either to that creation script, or to the
functions in the utility folder rail_base/src/rail/utils/interactive.
All subclasses of RailStage are considered valid candidates to have interative
functions, and it is thus an error for a RailStage to lack the required
infrastructure to have an interactive function generated. See the notes on
adding new stages regarding requirements and best practices when
writing RAIL stages to work well with the interactive module.