********
Creation
********

.. quick summary: covers everything that you would skip if you had real data
.. creating and modeling samples of photometric catalogs of galaxies and modifying them to add noise and biases

Creation is a type of RAIL stage which creates and models samples of photometric
catalogs of galaxies, and modifies them to add noise and biases. Creation stages
use creators to generate data, and degraders to add noise.

.. image:: /images/creation.png

.. contents:: Table of Contents
   :backlinks: top
   :local:

========
Creators
========

.. format and check

Mock DESC data are important for systematically testing the performance of
various photo-:math:`z` algorithms.  One of the lessons learned from DC1 is that it is
desirable for the mock data to include not only true redshifts and LSST
photometry (i.e., fluxes in the six LSST bands) but also true posterior PDFs,
:math:`p(z_t | \mathbf{p}_t)`, which are unavailable for spectroscopically confirmed
data sets as well as traditional simulations.  Furthermore, the mock photometry
data should contain realistic noise, selection effects, and biases. This is
critical for the training and validation of photo-:math:`z` algorithms.

To address these needs, :py:class:`rail.creation` enables us to create datasets
with true PDFs that allow PDF-to-PDF metrics computations and forward-modeling
of mock data for validating photo-z approaches under realistically complex
conditions.  This is realized by two main types of stages within
:py:class:`rail.creation`: (1) `engines` that forward-model photometric catalogs
and (2) `degraders` that modify such catalogs to introduce tunable physical
imperfections.


An engine is defined by a pair of stages that are subclasses of each of the
following superclasses: :py:class:`rail.creation.Modeler` makes a model of the
:math:`p(z, \mathrm{photometry})` joint probability space based on input parameters or
data, and :py:class:`rail.creation.Creator` samples :math:`(z, \mathrm{photometry})`
from the forward model.

.. ============
.. Creators API
.. ============

.. format and check

--------------------------------------------
FSPS (Flexible Stellar Population Synthesis)
--------------------------------------------

RAIL Package: https://github.com/LSSTDESC/rail_fsps

``FSPS`` is a RAIL module that creates an interface to the `Python` bindings of
the popular stellar population synthesis (SPS) code ``FSPS`` (Flexible Stellar
Population Synthesis, Conroy et al. 2009, 2010). ``FSPS`` aims at generating
realistic galaxy spectral energy distributions (SEDs) by modelling all the
components that contribute to the light from a galaxy: stars, gas, dust and AGN.
``FSPS`` is widely used both for stellar population inference (Johnson et al.
1)    and for forward modelling of galaxy SEDs (e.g., Alsing et al. 2023,
Tortorelli et al. 2024).

``FSPS`` provides substantial flexibility in terms of the prescription for
modelling each of the mentioned components. It also requires physical properties
of galaxies as input, such as star formation histories (SFHs), metallicities and
redshift, in order to generate their SEDs. We maintained this flexibility in the
interface we implemented in RAIL, allowing the user to change every possible
``FSPS`` parameter. The code has been parallelized to make efficient use of the
multiprocessing capabilities of CPUs.

The interface is integrated in the RAIL workflow, requiring as input a catalog
of galaxy physical properties in the form of :py:class:`Hdf5Handle`. These are galaxy
redshifts, stellar metallicities, velocity dispersions, gas metallicities and
ionization parameters (defined as the ratio of ionizing photons to the total
hydrogen density), dust attenuation and emission parameters, and star-formation
histories.

``FSPS`` follows the structure of `engines`. The :py:class:`Modeler` class requires galaxy
physical properties as input and produces as output an :py:class:`Hdf5Handle` that
contains the ``FSPS``-generated rest-frame SED for each galaxy and the common
rest-frame wavelength grid. The user can choose the units of the output
rest-frame SEDs by setting the appropriate keyword value. The default behavior
is to output the SEDs in a wavelength grid.

The output rest-frame SEDs constitute the input for the ``FSPS`` :py:class:`Creator` class.
The latter computes apparent AB magnitudes for a set of user-defined
waveband filters. Notice that the wavelength range spanned by the waveband
filters should be within the SED observed-frame wavelength ranges. A default set
of filters is implemented in :py:mod:`rail.fsps`, containing the Rubin LSST filters
among others.

.. autoclass:: rail.fsps.FSPSSedModeler
    :noindex:

.. autoclass:: rail.fsps.FSPSPhotometryCreator
    :noindex:

--------------------------------------------------
DSPS (Differentiable Stellar Population Synthesis)
--------------------------------------------------

RAIL Package: https://github.com/LSSTDESC/rail_dsps

``dsps`` is a module that creates an interface in RAIL to the code DSPS
(Differentiable Stellar Population Synthesis, Hearin et al. 2023). DSPS is
implemented natively in the JAX library as its main aim is to produce
differentiable predictions for the SED of a galaxy based on SPS. The
implementation in JAX allows DSPS to be a factor of 5 faster than standard SPS
codes, such as FSPS, and more than 300 times faster, if run on a modern GPU.
DSPS does not come with stellar population templates; they must be provided by
the user. The code contains a series of convenience functions that allow the
user to generate stellar population templates with FSPS. If no templates are
supplied, the implementation in RAIL automatically downloads a set of
FSPS-generated stellar population templates.

The :py:class:`Modeler` class of ``dsps`` requires as input a catalog of galaxy physical
properties in the form of :py:class:`Hdf5Handles`. In particular, the user provides, for
each galaxy, a star-formation history, a grid of Universe age over which the
stellar mass build-up takes place, and a value for the mean and scatter of the
stellar metallicity distribution. The output is an :py:class:`Hdf5Handle` that contains
galaxy rest-frame SEDs, produced over the stellar population template wavelength
grid.

The :py:class:`Creator` class of dsps uses the output rest-frame SEDs to compute
apparent and rest-frame AB magnitudes for a set of user-defined filters.
Rubin-LSST filters are present in the default filter suite. The magnitudes are
computed using the appropriate functions implemented in DSPS that, much like
the SED generation, can take advantage of multiprocessing capabilities.

.. autoclass:: rail.dsps.DSPSSingleSedModeler
    :noindex:

.. autoclass:: rail.dsps.DSPSPopulationSedModeler
    :noindex:

.. autoclass:: rail.dsps.DSPSPhotometryCreator
    :noindex:

-------------
PZFlow Engine
-------------

RAIL Package: https://github.com/LSSTDESC/rail_pzflow

``PZFlow`` is a generative model that simulates galaxy catalogs using normalizing
flows. Normalizing flows learn differentiable mappings between complex data
distributions and a simple latent distribution, for example, a Normal
distribution, hence the name *normalizing* flow.  In the creation module, a
normalizing flow is trained to map the distribution of galaxy colors and
redshifts onto a simple latent distribution.  New galaxy catalogs can then be
simulated by sampling from the latent distribution and applying the inverse flow
to the samples.  In addition, because the samples are generated by sampling from
a distribution we have direct access to, there is a natural notion of a *true*
redshift distribution for each galaxy in the catalog.  For more information, see
Crenshaw et al. 2024. Note that ``PZFlow`` is also used to perform photo-z
estimation.

.. autoclass:: rail.pzflow.FlowModeler
    :noindex:

.. autoclass:: rail.pzflow.FlowPosterior
    :noindex:

=========
Degraders 
=========

.. format and check

Each engine produces a catalog from some input information, but turning the
truth catalog into realistically imperfect observations necessitates additional
steps in a forward model. A degrader may be a subclass of either
:py:class:`rail.creation.noisifier` (later referred to as noisifier) or
:py:class:`rail.creation.selector` (later referred to as selector), the first of
which modifies data in place and the second of which removes rows from a
catalog. The only exception is the blending degrader, which changes both. We
provide several survey-specific shortcuts to mimic the selection functions of
precursor data sets. Specifically, the noisifier superclass imposes
realistically complex noise and bias to the (𝑧, photometry) columns, and the
selector superclass introduces biased selection on the sample to mimic, e.g., an
incomplete spectroscopic training sample.

----------------
LSST Error Model
----------------

The ``LSSTErrorModel`` is a wrapper of the ``PhotErr`` photometric error model (Crenshaw et
al. 2024). ``PhotErr`` is a generalization of the error model described in Ivezić et al.
(2019) that includes multiple methods for modeling photometric errors, non-detections,
and extended source errors. In addition to photometric error model for LSST, we also
include models for Euclid (Euclid Collaboration et al. 2022) and Nancy Grace Roman
(Spergel et al. 2015) space telescopes. The magnitude errors are estimated based on the
input galaxy properties and the survey conditions, such as 5𝜎 depth and seeing, and
each galaxy has noise added to its magnitude according to a Gaussian distribution with
mean zero and standard deviation equal to its magnitude error. For more information, see
Appendix B of Crenshaw et al. (2024).

.. autoclass:: rail.astro_tools.PhotoErrorModel
   :noindex:

.. autoclass:: rail.astro_tools.LSSTErrorModel
   :noindex:

.. autoclass:: rail.astro_tools.RomanErrorModel
   :noindex:

.. autoclass:: rail.astro_tools.RomanWideErrorModel
   :noindex:

.. autoclass:: rail.astro_tools.RomanMediumErrorModel
   :noindex:

.. autoclass:: rail.astro_tools.RomanDeepErrorModel
   :noindex:

.. autoclass:: rail.astro_tools.RomanUltraDeepErrorModel
   :noindex:

.. autoclass:: rail.astro_tools.EuclidErrorModel
   :noindex:

.. autoclass:: rail.astro_tools.EuclidWideErrorModel
   :noindex:

.. autoclass:: rail.astro_tools.EuclidDeepErrorModel
   :noindex:

----------------------------
Observing Condition Degrader
----------------------------

This degrader produces observed magnitude and magnitude errors for the truth sample,
based on the input survey condition maps (Hang et al. 2024). The user provides a series
of survey condition maps in ``HEALPix`` (Górski et al. 2005) format with specified
𝑁side, e.g. the 5𝜎 depth in each band. The galaxies in the truth sample will be
assigned survey conditions corresponding to their ``HEALPix`` pixel, either based on
their coordinates in the original catalog, or randomly if only photometry is available
(e.g., generated from the engines). In the latter case, a weight map can be specified to
adjust the number of galaxies assigned to each pixel. A key input for
``ObservingConditionDegrader`` is ``map_dict``. This is a dictionary containing keys
with the same names as parameters for ``LSSTErrorModel``. Under each key, one can pass a
series of paths for the survey condition maps for each band, or, if any quantity is held
constant throughout the footprint, one can also pass a float number. The degrader then
calls ``PhotErr`` to compute noisy magnitudes for each galaxy in each ``HEALPix`` pixel. The
output of this module is a table containing degraded magnitudes, magnitude errors, RA,
Dec, and the ``HEALPix`` pixel index of each galaxy.

.. autoclass:: rail.astro_tools.ObsCondition
   :noindex:

-----------------------
Spectroscopic Degraders
-----------------------

``SpectroscopicDegraders`` contains two simple degraders that simulate systematic errors
associated with the presence of spectroscopic redshifts in spectroscopic training
catalogs.

The first is ``InvRedshiftIncompleteness``. It is a toy model for redshift
incompleteness -- i.e., the failure of a particular spectrograph to obtain a redshift
estimate for a particular set of galaxies. It takes an input catalog and keeps all the
galaxies below a configurable redshift threshold while randomly removing galaxies above
it. The probability that a redshift :math:`z` galaxy is kept is:

.. math:: p(z) - \mathrm{min}\left( 1, \frac{z_\mathrm{th}}{z} \right),

where :math:`z_\mathrm{th}` is the threshold redshift.

The other degrader is ``LineConfusion``, which simulates redshift errors due to the
confusion of emission lines. For example, if the OII line at :math:`3727~\mathring{\mathrm{A}}` was
misidentified as the OIII line at :math:`5007~\mathring{\mathrm{A}}`, the assigned spectroscopic redshift
would be greater than the true redshift (Newman et al. 2013). The inputs of this
degrader are a `true` and `wrong` redshift, and an error rate. The degrader then
randomly simulates line confusion, ignoring galaxies for which the misidentification
would result in a negative redshift (which can occur when the wrong wavelength is
shorter than the true wavelength).

.. autoclass:: rail.astro_tools.LineConfusion
   :noindex:

.. autoclass:: rail.astro_tools.InvRedshiftIncompleteness
   :noindex:

-----------
QuantityCut
-----------

This degrader provides a trimmed version of the input catalog based on selection cuts
applied to the catalog quantities. The user provides the parameter cuts, which is a
dictionary with keys being the columns to which the selection is to be applied (e.g.,
the 𝑖-band magnitude), and the values being the specific cuts. Two types of values can
be provided: a single float number (e.g., 25.3), which is interpreted as a maximum value
(i.e., the cut will remove samples with 𝑖 > 25.3), and a tuple (e.g., (17, 25.3)),
which is interpreted as a range within which the sample is selected (i.e., the selected
sample has 17 < 𝑖 < 25.3). When multiple cuts are applied at the same time, only the
intersection of selected samples of each cut will be kept in the output.

.. autoclass:: rail.creation.degraders.quantityCut.QuantityCut
   :noindex:

-----------------------
Spectroscopic Selectors
-----------------------

The ``SpectroscopicSelection`` degrader applies the selection for a spectroscopic
survey. It provides tailored catalogs that match a particular spectroscopic survey for
subsequent calibration steps. It can also be used to generate selected mock catalogs
used as realistic reference samples. The selection criteria are cuts on magnitudes or
colors adopted for the associated spectroscopic survey targeting. The current available
selectors are for VVDSf02 (Le Fèvre et al. 2005), zCOSMOS (Lilly et al. 2009), GAMA
(Driver et al. 2011), BOSS (Dawson et al. 2013), and DEEP2 (Newman et al. 2013).
SpectroscopicSelection requires a 2-dimensional spectroscopic redshift success rate as
a function of two variables (often two of magnitude, color, or redshift), specific to
the redshift survey for which selection is being emulated. The degrader will draw the
appropriate fraction of samples from the input data and return an incomplete sample.
Additional redshift cuts based on percentile can be applied when using a color-based
redshift cut.

Similar functionality is provided by ``GridSelection`` (Moskowitz et al. 2024), which
can be used to model spectroscopic success rates for the training sets used for the
second data release of the Hyper Suprime Cam Subaru Strategic Program (HSC; Aihara et
al. 2019). Given a 2-dimensional grid of spectroscopic success ratio as a function of
two variables (often magnitude or color), the degrader will draw the appropriate
fraction of samples from the input data and return incomplete sample. Additional
redshift cuts can also be applied, where all redshifts above the cutoff are removed. In
addition to the default HSC grid, RAIL accepts user-defined setting files for the
success ratio grids appropriate for other surveys.

.. autoclass:: rail.astro_tools.SpecSelection
   :noindex:

.. autoclass:: rail.astro_tools.SpecSelection_GAMA
   :noindex:

.. autoclass:: rail.astro_tools.SpecSelection_BOSS
   :noindex:

.. autoclass:: rail.astro_tools.SpecSelection_DEEP2
   :noindex:

.. autoclass:: rail.astro_tools.SpecSelection_VVDSf02
   :noindex:

.. autoclass:: rail.astro_tools.SpecSelection_zCOSMOS
   :noindex:

.. autoclass:: rail.astro_tools.SpecSelection_HSC
   :noindex:

. .autoclass: rail.astro_tools.GridSelection

---------------
SOMSpecSelector
---------------

While ``GridSelection`` defines a selection mask in two dimensions, ``SOMSpecSelector``
can take any number of input features with which to define a spectroscopic selection.
This selector takes an initial complete sample (which we will call the input sample) and
return a subset that approximately matches the properties of an incomplete sample (we
will refer to this as the specz sample). The selector operates by taking the list of
features (which must be present in both the input and specz samples) and constructs a
self-organizing map (SOM; Kohonen 1982) from the input data, creating a mapping from the
higher-dimensional feature set to the 2D grid of SOM cells. It then finds the best cell
assignment for each galaxy in both the input and specz samples. The selector builds a
mask as it iterates over all cells, and for each cell returns a random subset of input
objects that lie in that cell that equal in number to specz objects in the cell. If the
cell has more specz objects than are available in the input catalog, then it returns all
that are available. By matching the number of objects cell by cell the selector
naturally mimics the features of the specz sample.

.. autoclass:: rail.creation.degraders.specz_som.SOMSpecSelector
   :noindex:

-----------------
Blending Degrader
-----------------

This degrader creates mock unrecognized blends based on source density. Unrecognized
blends are sources overlapping too closely in projection and are detected as one object
(referred to as `ambiguous blends` in Dawson et al. 2016). This degrader first searches
for close objects that are likely to become unrecognized blends, then merges their
fluxes to create one blended object. The source IDs of blend components are saved for
references.

The blending components are found by matching the RA and Dec coordinates of an input
catalog with itself via a Friends-of-Friends (FoF) algorithm (Mao et al. 2021). The
advantage of the FoF algorithm is that it can produce unrecognized blends from multiple
sources rather than just pairs. The algorithm groups sources such that within each
group, every source is separated from at lease one another group member by an angular
distance less than a specified `linking length`. By setting a small enough linking
length (e.g., 1 arcsec), we assume that all group members will be blended into one
detection. In the future, we might implement options for a more sophisticated
identification of blends using source sizes and shapes. In the current release, this
degrader simply sums up fluxes over all group members to create one blended object per
group. Note that we do not currently simulate the impact on aperture photometry due to
irregular profiles of blends either, but are motivated to conduct such a study in the
future.

Note that the truth redshifts of blended objects are ambiguous since they are composed
of multiple objects. We provide several summary columns for the truth: ``z_brightest``
is the redshift of the brightest component; ``z_mean`` is the average redshift of all
components; and ``z_weighted`` is the flux-weighted average redshift. For blended
objects composed of more than (including) two components, the standard deviation of
redshifts is provided. The decision on the truth redshift is left to the users. For more
complicated truth estimation -- e.g., considering the colors of components, as bluer
galaxies tend to have strong emission lines which are often used to infer redshifts from
spectroscopy -- users have the option to trace the components with source IDs. The
tutorial ``blending_degrader_demo`` illustrates how to match the output catalog with the
source IDs and the input catalog to access more information.

The order of application is particularly important for this degrader. Generally, this
degrader should be applied before any selections on the truth catalog, including any
magnitude, color, or signal-to-noise ratio cuts. The reason is that bright sources can
blend with fainter ones, and two faint sources might blend into a brighter object that
enters the target depth selection. For example, a magnitude difference of :math:`\sim2.5`
translates roughly into a flux contamination of 10%. However, applying this degrader to
the original truth catalog without any cuts can be a computational burden, because the
truth catalog is often much larger than the target-depth catalog. To mitigate this
issue, one can use a magnitude cut to decrease the target depth by {an arbitrary
threshold (e.g., 2 or 3 magnitudes)} before running this degrader.

While preliminary studies have addressed some aspects of blending on photo-z  (e.g.,
Nourbakhsh et al. 2022), a thorough quantitative exploration of this topic will be
important to develop a deeper understanding of the issue and its impacts on various
science cases.

.. autoclass:: rail.creation.degraders.unrec_bl_model.UnrecBlModel
   :noindex: