Clustering redshifts with yet_another_wizz

This notebooks summarises the steps to compute clustering redshifts for an unknown sample of galaxies using a reference sample with known redshifts. Additionally, a correction for the galaxy bias of the reference sample is applied (see Eqs. 17 & 20 in van den Busch et al. 2020).

This involves a number of steps (see schema below): 1. Preparing the input data (creating randoms, applying masks; simplfied here). 2. Splitting the data into spatial patches and cache them on disk for faster access. 3. Computing the autocorrelation function amplitude \(w_{\rm ss}(z)\), used as correction for the galaxy bias 4. Computing the cross-correlation function amplitude \(w_{\rm sp}(z)\), which is the biased redshift estimate. 5. Summarising the result by correcting for the refernece sample bias and producing a redshift estimate (not a PDF!).

Note: The cached data must be removed manually since its lifetime can currently not be handled by RAIL.

The aim of this notebook is to give an overview of the wrapper functionality, including a summary of all currently implemented optional parameters (commented). It is not meant to be a demonstaration of the performance of yet_another_wizz since the example data used here is very small and the resulting signal-to-noise ratio is quite poor.

Note: If you’re interested in running this in pipeline mode, see 10_YAW.ipynb in the pipeline_examples/estimation_examples/ folder.

import shutil
from pathlib import Path

import matplotlib.pyplot as plt
from rail.core.data import Hdf5Handle
from rail.estimation.algos.cc_yaw import create_yaw_cache_alias
from rail.yaw_rail.utils import get_dc2_test_data
from yaw.randoms import BoxRandoms

from rail import interactive as ri

DOWNLOADS_DIR = Path("../examples_data")
DOWNLOADS_DIR.mkdir(exist_ok=True)
CACHE_DIR = DOWNLOADS_DIR / "yaw_cache"
CACHE_DIR.mkdir(exist_ok=True)

Install FSPS with the following commands:
pip uninstall fsps
git clone --recursive https://github.com/dfm/python-fsps.git
cd python-fsps
python -m pip install .
export SPS_HOME=$(pwd)/src/fsps/libfsps

LEPHAREDIR is being set to the default cache directory:
/home/runner/.cache/lephare/data
More than 1Gb may be written there.
LEPHAREWORK is being set to the default cache directory:
/home/runner/.cache/lephare/work
Default work cache is already linked.
This is linked to the run directory:
/home/runner/.cache/lephare/runs/20260504T123336

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 758, in start
    self.io_loop.start()
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 211, in start
    self.asyncio_loop.run_forever()
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/utils.py", line 71, in preserve_context
    return await f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 621, in shell_main
    await self.dispatch_shell(msg, subshell_id=subshell_id)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 478, in dispatch_shell
    await result
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 372, in execute_request
    await super().execute_request(stream, ident, parent)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 834, in execute_request
    reply_content = await reply_content
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 464, in do_execute
    res = shell.run_cell(
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 663, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3077, in run_cell
    result = self._run_cell(
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3132, in _run_cell
    result = runner(coro)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner
    coro.send(None)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3336, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3519, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3579, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_5366/2253991977.py", line 10, in <module>
    from rail import interactive as ri
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/interactive/__init__.py", line 3, in <module>
    from . import calib, creation, estimation, evaluation, tools
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/interactive/calib/__init__.py", line 3, in <module>
    from rail.utils.interactive.initialize_utils import _initialize_interactive_module
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/utils/interactive/initialize_utils.py", line 17, in <module>
    from rail.utils.interactive.base_utils import (
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/utils/interactive/base_utils.py", line 10, in <module>
    rail.stages.import_and_attach_all(silent=True)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/stages/__init__.py", line 74, in import_and_attach_all
    RailEnv.import_all_packages(silent=silent)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/core/introspection.py", line 541, in import_all_packages
    _imported_module = importlib.import_module(pkg)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/som/__init__.py", line 1, in <module>
    from rail.creation.degraders.specz_som import *
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/rail/creation/degraders/specz_som.py", line 15, in <module>
    from somoclu import Somoclu
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/somoclu/__init__.py", line 11, in <module>
    from .train import Somoclu
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/somoclu/train.py", line 25, in <module>
    from .somoclu_wrap import train as wrap_train
  File "/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/somoclu/somoclu_wrap.py", line 11, in <module>
    import _somoclu_wrap

---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

File /opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/numpy/core/_multiarray_umath.py:44, in __getattr__(attr_name)
     39     # Also print the message (with traceback).  This is because old versions
     40     # of NumPy unfortunately set up the import to replace (and hide) the
     41     # error.  The traceback shouldn't be needed, but e.g. pytest plugins
     42     # seem to swallow it and we should be failing anyway...
     43     sys.stderr.write(msg + tb_msg)
---> 44     raise ImportError(msg)
     46 ret = getattr(_multiarray_umath, attr_name, None)
     47 if ret is None:


ImportError:
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Warning: the binary library cannot be imported. You cannot train maps, but you can load and analyze ones that you have already saved.
The problem occurs because either compilation failed when you installed Somoclu or a path is missing from the dependencies when you are trying to import it. Please refer to the documentation to see your options.

VERBOSE = "debug"  # verbosity level of built-in logger, disable with "error"

1. Data preparation

Since this is a simple example, we are not going into the details of creating realistic randoms and properly masking the reference and unknown data to a shared footprint on sky. Instead, we are using a simulated dataset that serves as both, reference and unknown sample.

First, we download the small test dataset derived from 25 sqdeg of DC2, containing 100k objects on a limited redshift range of \(0.2 < z < 1.8\). We add the data as new handle to the datastore.

test_data = get_dc2_test_data()  # downloads test data, cached for future calls
redshifts = test_data["z"].to_numpy()
zmin = redshifts.min()
zmax = redshifts.max()
n_data = len(test_data)
print(f"N={n_data}, {zmin:.1f}<z<{zmax:.1f}")
handle_test_data = Hdf5Handle("input_data", test_data)

N=100000, 0.2<z<1.8

Next we generate a x10 enhanced uniform random dataset for the test data constrained to its rectangular footprint. We add redshifts by cloning the redshift column "z" of the dataset.

generator = BoxRandoms(
    test_data["ra"].min(),
    test_data["ra"].max(),
    test_data["dec"].min(),
    test_data["dec"].max(),
    redshifts=redshifts,
    seed=12345,
)
test_rand = generator.generate_dataframe(n_data * 10)
test_rand.rename(columns=dict(redshifts="z"), inplace=True)
handle_test_rand = Hdf5Handle("input_rand", test_rand)

2. Splitting and caching the data

This step is crucial to compute consistent clustering redshift uncertainties. yet_another_wizz uses spatial (jackknife) resampling and therefore, every input dataset must be split into the same exact spatial regions/patches. To improve the parallel performance, the datasets and randoms are pre-arranged into these patches and cached on disk for better random patch-wise access. While this is slow for small datasets, it is highly beneficial for large datasets with many patches and/or memory constraints.

The RAIL wrapper uses manually specified cache directories, each of which contains one dataset and optionally corresponding randoms. This ensures that the patch centers are defined consistently. To create a new cache, use the YawCacheCreate.create() method.

Note on names and aliasing in RAIL

We need to create separate caches for the reference and the unknown data, which means that we need to run the YawCacheCreate twice. Since that creates name clashes in the RAIL datastore, we need to properly alias the inputs (data/ rand) and the output (cache) by providing a dictionary for the aliases parameter when calling the make_stage(), e.g. by adding a unique suffix:

name = "stage_name"
aliases = dict(data="data_suffix", rand="rand_suffix", cache="cache_suffix")

There is a shorthand for convenience (from rail.yaw_rail.cache.AliasHelper) that allows to generate this dictionary by just providing a suffix name for the stage instance (see example below).

name = "stage_name"
aliases = create_yaw_cache_alias("suffix")

The reference data

To create a cache directory we must specify a path to the directory at which the data will be cached. This directory must not exist yet. We also have to specify a name for the stage to ensure that the reference and unknown caches (see below) are properly aliased to be distinguishable by the RAIL datastore.

Furthermore, a few basic column names that describe the tabular input data must be provided. These are right ascension (ra_name) and declination (dec_name), and in case of the reference sample also the redshifts (redshift_name). Finally, the patches must be defined and there are three ways to do so: 1. Stage parameter patch_file: Read the patch center coordinates from an ASCII file with pairs of R.A/Dec. coordinates in radian. 2. Stage parameter patch_num: Generating a given number of patches from the object positions (peferrably of the randoms if possible) using k-means clustering. 3. Stage parameter patch_name: Providing a column name in the input table which contains patch indices (using 0-based indexing). 4. Stage input patch_source: Using the patch centers from a different cache instance, given by a cache handle. When this input is provided it takes precedence over any of the stage parameters above.

In this example we choose to auto-generate five patches. In a more realistic setup this number should be much larger.

stage_cache_ref = ri.estimation.algos.cc_yaw.yaw_cache_create(
    data=handle_test_data,
    path=CACHE_DIR / "ref",
    rand=handle_test_rand,
    aliases=create_yaw_cache_alias("ref"),
    overwrite=True,  # default: False
    ra_name="ra",
    dec_name="dec",
    redshift_name="z",
    patch_source="none",
    # weight_name=None,
    # patch_name=None,
    patch_num=5,  # default: None
    # max_workers=None,
    verbose=VERBOSE,  # default: "info"
)

Inserting handle into data store.  data_ref: <class 'rail.core.data.Hdf5Handle'> None, (d), YawCacheCreate
Inserting handle into data store.  rand_ref: <class 'rail.core.data.Hdf5Handle'> None, (d), YawCacheCreate
Inserting handle into data store.  patch_source_ref: none, YawCacheCreate
YAW | yet_another_wizz v3.1.2

INF | running in multiprocessing environment with 2 workers

INF | loading 1M records in 1 chunks from memory

DBG | selecting input columns: ra, dec, z

DBG | creating 5 patches

INF | computing 5 patch centers from subset of 224K records

DBG | running preprocessing on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

INF | using cache directory: ../examples_data/yaw_cache/ref/rand

INF | computing patch metadata

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

INF | loading 100K records in 1 chunks from memory

DBG | selecting input columns: ra, dec, z

DBG | applying 5 patches

DBG | running preprocessing on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

INF | using cache directory: ../examples_data/yaw_cache/ref/data

INF | computing patch metadata

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

Inserting handle into data store.  output_ref: inprogress_output_ref.path, YawCacheCreate

We can see from the log messages that yet_another_wizz processes the randoms first and generates patch centers (creating 5 patches) and then applies them to the dataset, which is processed last (applying 5 patches). Caching the data can take considerable time depending on the hardware and the number of patches.

The unknown data

The same procedure for the unknown sample, however there are some small, but important differences. We use a different path and name, do not specify the redshift_name (since we would not have this information with real data), and here we chose to not provide any randoms for the unknown sample and instead rely on the reference sample randoms for cross-correlation measurements.

Most importantly, we must ensure that the patch centers are consistent with the reference data and therefore provide the reference sample cache as a stage input called patch_source.

Important: Even if the reference and unknown data are the same as in this specific case, the automatically generated patch centers are not deterministic. We can see in the log messages that the code reports applying 5 patches.

handle_cache_unk = ri.estimation.algos.cc_yaw.yaw_cache_create(
    data=handle_test_data,
    path=CACHE_DIR / "unk",
    rand="none",
    patch_source=stage_cache_ref["output"],
    name="cache_unk",
    aliases=create_yaw_cache_alias("unk"),
    overwrite=True,  # default: False
    ra_name="ra",
    dec_name="dec",
    # redshift_name=None,
    # weight_name=None,
    # patch_name=None,
    # patch_num=None,
    # max_workers=None,
    verbose=VERBOSE,  # default: "info"
)["output"]

Inserting handle into data store.  data_unk: <class 'rail.core.data.Hdf5Handle'> None, (d), cache_unk
Inserting handle into data store.  rand_unk: none, cache_unk
Inserting handle into data store.  patch_source_unk: YawCache(path='../examples_data/yaw_cache/ref'), cache_unk
YAW | yet_another_wizz v3.1.2

INF | running in multiprocessing environment with 2 workers

INF | loading 100K records in 1 chunks from memory

DBG | selecting input columns: ra, dec

DBG | applying 5 patches

DBG | running preprocessing on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

INF | using cache directory: ../examples_data/yaw_cache/unk/data

INF | computing patch metadata

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

Inserting handle into data store.  output_cache_unk: inprogress_output_cache_unk.path, cache_unk

3. Computing the autocorrelation / bias correction

The bias correction is computed from the amplitude of the angular autocorrelation function of the reference sample. The measurement parameters are the same as for the cross-correlation amplitude measurement, so we can define all configuration parameters once in a dictionary.

As a first step, we need to decide on which redshift bins/sampling we want to compute the clustering redshifts. Here we choose the redshift limits of the reference data (zmin/zmax) and, since the sample is small, only 8 bins (zbin_num) spaced linearly in redshift (default method="linear"). Finally, we have to define the physical scales in kpc (rmin/rmax, converted to angular separation at each redshift) on which we measure the correlation amplitudes.

Optional parameters: Bins edges can alternatively specifed manually through zbins. To apply scale dependent weights, e.g. \(w \propto r^{-1}\), specify the power-law exponent asrweight=-1. The parameter resolution specifies the radial resolution (logarithmic) of the weights.

corr_config = dict(
    rmin=100,  # in kpc
    rmax=1000,  # in kpc
    # rweight=None,
    # resolution=50,
    zmin=zmin,
    zmax=zmax,
    num_bins=8,  # default: 30
    # method="linear",
    # edges=np.linspace(zmin, zmax, zbin_num+1)
    # closed="right",
    # max_workers=None,
    verbose=VERBOSE,  # default: "info"
)

We then measure the autocorrelation using the YawAutoCorrelate.correlate() method, which takes a single parameter, the cache (handle) of the reference dataset.

result_auto_corr = ri.estimation.algos.cc_yaw.yaw_auto_correlate(
    sample=stage_cache_ref["output"], **corr_config
)

Inserting handle into data store.  sample: YawCache(path='../examples_data/yaw_cache/ref'), YawAutoCorrelate
YAW | yet_another_wizz v3.1.2

INF | running in multiprocessing environment with 2 workers

INF | building data trees

DBG | building patch-wise trees (using 8 bins)

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

INF | building random trees

DBG | building patch-wise trees (using 8 bins)

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

INF | computing auto-correlation from DD, DR, RR

DBG | computing patch linkage with max. separation of 1.42e-03 rad

DBG | created patch linkage with 19 patch pairs

DBG | using 1 scales without weighting

INF | counting DD from patch pairs

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

INF | counting DR from patch pairs

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

INF | counting RR from patch pairs

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

Inserting handle into data store.  output: inprogress_output.hdf5, YawAutoCorrelate

As the code is progressing, we can observe the log messages of yet_another_wizz which indicate the performed steps: getting the cached data, generating the job list of patches to correlate, and counting pairs. Finally, the pair counts are stored as custom data handle in the datastore.

We can interact with the returned pair counts (yaw.CorrFunc, documentation) manually if we want to investigate the results:

counts_auto = result_auto_corr["output"]  # extract payload from handle
counts_auto.dd

NormalisedCounts(auto=True, binning=8 bins @ (0.200...1.800], num_patches=5)

4. Computing the cross-correlation / redshift estimate

The cross-correlation amplitude, which is the biased estimate of the unknown redshift distribution, is computed similarly to the autocorrelation above. We measure the correlation using the YawCrossCorrelate.correlate() method, which takes two parameters, the cache (handles) of the reference and the unknown data.

result_cross_corr = ri.estimation.algos.cc_yaw.yaw_cross_correlate(
    reference=stage_cache_ref["output"], unknown=handle_cache_unk, **corr_config
)

Inserting handle into data store.  reference: YawCache(path='../examples_data/yaw_cache/ref'), YawCrossCorrelate
Inserting handle into data store.  unknown: YawCache(path='../examples_data/yaw_cache/unk'), YawCrossCorrelate
YAW | yet_another_wizz v3.1.2

INF | running in multiprocessing environment with 2 workers

INF | building reference data trees

DBG | building patch-wise trees (using 8 bins)

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

INF | building reference random trees

DBG | building patch-wise trees (using 8 bins)

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

INF | building unknown data trees

DBG | building patch-wise trees (unbinned)

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

INF | computing cross-correlation from DD, RD

DBG | computing patch linkage with max. separation of 1.42e-03 rad

DBG | created patch linkage with 19 patch pairs

DBG | using 1 scales without weighting

INF | counting DD from patch pairs

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

INF | counting RD from patch pairs

DBG | running parallel jobs on 2 workers

WRN | os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

Inserting handle into data store.  output: inprogress_output.hdf5, YawCrossCorrelate

As before, we can see the actions performed by yet_another_wizz. The main difference for the cross-correlation function is that the second sample (the unknown data/randoms) are not binned by redshift when counting pairs.

As for the autocorrelation, we can interact with the result, e.g. by evaluating the correlation estimator manually and getting the cross-correlation amplitude per redshift bin.

counts_cross = result_cross_corr["output"]  # extract payload from handle
corrfunc = counts_cross.sample()  # evaluate the correlation estimator
corrfunc.data

DBG | sampling correlation function with estimator 'DP'

array([0.0023559 , 0.00403239, 0.00684468, 0.01126758, 0.00945143,
       0.00898257, 0.00882812, 0.01273689])

5. Computing the redshift estimate

The final analysis step is combining the two measured correlation amplitudes to get a redshift estimate which is corrected for the reference sample bias. This estimate is not a PDF. Converting the result to a proper PDF (without negative values) is non-trivial and requires further modelling stages that are currently not part of this wrapper.

We use YawSummarize.summarize() method, which takes the pair count handles of the cross- and autocorrelation functions as input. In principle, the autocorrelation of the unknown sample could be specified to fully correct for galaxy bias, however this is not possible in practice since the exact redshifts of the unknown objects are not known.

result_summarize = ri.estimation.algos.cc_yaw.yaw_summarize(
    cross_corr=result_cross_corr["output"],
    auto_corr_ref=result_auto_corr["output"],  # default: None
    auto_corr_unk="none",
    verbose=VERBOSE,  # default: "info"
)

Inserting handle into data store.  cross_corr: CorrFunc(counts=dd|rd, auto=False, binning=8 bins @ (0.200...1.800], num_patches=5), YawSummarize
Inserting handle into data store.  auto_corr_ref: CorrFunc(counts=dd|dr|rr, auto=True, binning=8 bins @ (0.200...1.800], num_patches=5), YawSummarize
Inserting handle into data store.  auto_corr_unk: none, YawSummarize
YAW | yet_another_wizz v3.1.2

INF | running in multiprocessing environment with 2 workers

DBG | sampling correlation function with estimator 'DP'

DBG | sampling correlation function with estimator 'LS'

DBG | computing clustering redshifts from correlation function samples

DBG | mitigating reference sample bias

Inserting handle into data store.  output: inprogress_output.pkl, YawSummarize

The stage produces a single output which contains the redshift estimate with uncertainties, jackknife samples of the estimate, and a covariance matrix. These data products are wrapped as yaw.RedshiftData documentation which gets stored as pickle file when running a ceci pipeline. Some examples on how to use this data is shown below.

Remove caches

The cached datasets are not automatically removed, since the algorithm does not know when they are no longer needed. Additionally, the reference data could be resued for future runs, e.g. for different tomographic bins.

Since that is not the case here, we just delete the cached data with a built-in method.

stage_cache_ref["output"].data.drop()
handle_cache_unk.data.drop()

Inspect results

Below are some examples on how to access the redshift binning, estimate, estimte error, samples and covariance matrix produced by yet_another_wizz.

ncc = result_summarize["output"]
ncc.data / ncc.error  # n redshift slices

array([0.99751831, 2.37893624, 2.14804959, 7.20831464, 2.18093417,
       5.07480432, 2.19060571, 2.21372888])

# true n(z)
zbins = result_cross_corr["output"].binning.edges
plt.hist(test_data["z"], zbins, density=True, color="0.8", label="true n(z)")

# fiducial n(z)
normalised = ncc.normalised()  # copy of data with n(z) is normalised to unity
ax = normalised.plot(label="YAW estimate")

# jackknife samples
normalised.samples.shape  # m jackknife-samples x n redshift slices
z = normalised.binning.mids
plt.plot(z, normalised.samples.T, color="k", alpha=0.2)
# create a dummy for the legend
plt.plot([], [], color="k", alpha=0.2, label="jackknife samples")
ax.legend()

DBG | normalising RedshiftData

<matplotlib.legend.Legend at 0x7ff2ae8479d0>

ncc.covariance.shape  # n x n redshift slices
ncc.plot_corr()

<Axes: >

Clean up Files

shutil.rmtree(CACHE_DIR)