Demo: RAIL Evaluation
Authors: Sam Schmidt, Alex Malz, Julia Gschwend, others…
Last run successfully: June 16, 2023 (with qp-prob >= 0.8.2)
The purpose of this notebook is to demonstrate the application of the metrics scripts to be used on the photo-z PDF catalogs produced by the PZ working group. The first implementation of the evaluation module is based on the refactoring of the code used in Schmidt et al. 2020, available on Github repository PZDC1paper.
To run this notebook, you must install qp and have the notebook in the
same directory as utils.py (available in RAIL’s examples
directrory). You must also have installed all RAIL dependencies,
particularly for the estimation codes that you want to run, as well as
ceci, qp, tables_io, etc… See the RAIL installation instructions for
more info.
Note: If you’re planning to run this in a notebook, you may want to
use interactive mode instead. See
Single_Evaluation.ipynb
in the interactive_examples/evaluation_examples/ folder for a
version of this notebook in interactive mode.
import rail
from rail.evaluation.metrics.cdeloss import *
from rail.evaluation.evaluator import OldEvaluator
from rail.core.data import QPHandle, TableHandle, Hdf5Handle
from rail.core.stage import RailStage
from utils import plot_pit_qq, ks_plot
import qp
import tables_io
import os
%matplotlib inline
%reload_ext autoreload
%autoreload 2
Data
To compute the photo-z metrics of a given test sample, it is necessary
to read the output of a photo-z code containing galaxies’ photo-z PDFs.
Let’s use the toy data available in tests/data/
(test_dc2_training_9816.hdf5 and test_dc2_validation_9816.hdf5)
to generate a small sample of photo-z PDFs using the FlexZBoost
algorithm available on RAIL’s estimation module.
Photo-z Results
Run FlexZBoost
If you have run the notebook RAIL_estimation_demo.ipynb, this will
produce a file output_fzboost.hdf5, writen at the location:
<your_path>/RAIL/examples/estimation_examples/output_fzboost.hdf5.
Otherwise we can download the file from NERSC
Next we need to set up some paths for the Data Store:
from rail.utils.path_utils import find_rail_file
try:
pdfs_file = find_rail_file('examples_data/evaluation_data/data/output_fzboost.hdf5')
except ValueError:
pdfs_file = 'examples_data/evaluation_data/data/output_fzboost.hdf5'
try:
os.makedirs(os.path.dirname(pdfs_file))
except FileExistsError:
pass
curl_com = f"curl -o {pdfs_file} https://portal.nersc.gov/cfs/lsst/PZ/output_fzboost.hdf5"
os.system(curl_com)
ztrue_file = find_rail_file('examples_data/testdata/test_dc2_validation_9816.hdf5')
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
53 47.1M 53 25.0M 0 0 23.0M 0 0:00:02 0:00:01 0:00:01 23.0M
100 47.1M 100 47.1M 0 0 33.2M 0 0:00:01 0:00:01 –:–:– 33.2M
Read the data in, note that the fzdata is a qp Ensemble, and thus we
should read it in as a QPHandle type file, while the ztrue_data is
tabular data, and should be read in as a Tablehandle when adding to
the data store
fzdata = qp.read(pdfs_file)
ztrue_data = tables_io.read(ztrue_file)
ztrue = ztrue_data['photometry']['redshift']
zgrid = fzdata.metadata['xvals'].ravel()
photoz_mode = fzdata.mode(grid=zgrid)
truth = Hdf5Handle('truth', data=ztrue_data['photometry'])
ensemble = QPHandle('ensemble',data=fzdata)
Make an evaulator stage
Now let’s set up the Evaluator stage to compute our metrics for the FlexZBoost results
FZB_eval = OldEvaluator.make_stage(name='FZB_eval', truth=truth, output_mode="return")
FZB_results = FZB_eval.evaluate(ensemble(), truth)
Inserting handle into data store. input: None, FZB_eval
Inserting handle into data store. truth: <class 'rail.core.data.Hdf5Handle'> None, (d), FZB_eval
/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/qp/metrics/array_metrics.py:27: UserWarning: p-value floored: true value smaller than 0.001. Consider specifying method (e.g. method=stats.PermutationMethod().) return stats.anderson_ksamp([p_random_variables, q_random_variables], **kwargs)
Inserting handle into data store. output_FZB_eval: inprogress_output_FZB_eval.hdf5, FZB_eval
We can view the results as a pandas dataframe:
results_df= tables_io.convertObj(FZB_results(), tables_io.types.PD_DATAFRAME)
results_df
| PIT_AD_stat | PIT_AD_pval | PIT_AD_significance_level | PIT_CvM_stat | PIT_CvM_pval | PIT_KS_stat | PIT_KS_pval | PIT_OutRate_stat | POINT_SimgaIQR | POINT_Bias | POINT_OutlierRate | POINT_SigmaMAD | CDE_stat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 84.956236 | NaN | 0.001 | 9.623352 | NaN | 0.03359 | NaN | 0.058738 | 0.020859 | 0.00027 | 0.106167 | 0.020891 | -6.74027 |
So, there we have it, a way to generate all of our summary statistics
for FZBoost. And note also that the results file has been written out to
output_FZB_eval.hdf5, the name we specified when we ran
make_stage (with output_ prepended).
As an alternative, and to allow for a little more explanation for each individual metric, we can calculate the metrics using functions from the evaluation class separate from the stage infrastructure. Here are some examples below.
CDF-based Metrics
PIT
The Probability Integral Transform (PIT), is the Cumulative Distribution Function (CDF) of the photo-z PDF
evaluated at the galaxy’s true redshift for every galaxy \(i\) in the catalog.
from qp.metrics.pit import PIT
pitobj = PIT(fzdata, ztrue)
quant_ens = pitobj.pit
metamets = pitobj.calculate_pit_meta_metrics()
/opt/hostedtoolcache/Python/3.10.20/x64/lib/python3.10/site-packages/qp/metrics/array_metrics.py:27: UserWarning: p-value floored: true value smaller than 0.001. Consider specifying method (e.g. method=stats.PermutationMethod().) return stats.anderson_ksamp([p_random_variables, q_random_variables], **kwargs)
The evaluate method PIT class returns two objects, a quantile distribution based on the full set of PIT values (a frozen distribution object), and a dictionary of meta metrics associated to PIT (to be detailed below).
quant_ens
Ensemble(the_class=quant,shape=(1, 96))
metamets
{'ad': Anderson_ksampResult(statistic=np.float64(84.95623553609381), critical_values=array([0.325, 1.226, 1.961, 2.718, 3.752, 4.592, 6.546]), pvalue=np.float64(0.001)),
'cvm': CramerVonMisesResult(statistic=9.62335199605935, pvalue=9.265039846440004e-10),
'ks': KstestResult(statistic=np.float64(0.033590049370962216), pvalue=np.float64(1.7621068075751534e-20), statistic_location=np.float64(0.9921210288809627), statistic_sign=np.int8(-1)),
'outlier_rate': np.float64(0.05873797877466336)}
PIT values
pit_vals = np.array(pitobj.pit_samps)
pit_vals
array([0.19392947, 0.36675619, 0.52017547, ..., 1. , 0.93189232,
0.4674437 ], shape=(20449,))
PIT outlier rate
The PIT outlier rate is a global metric defined as the fraction of galaxies in the sample with extreme PIT values. The lower and upper limits for considering a PIT as outlier are optional parameters set at the Metrics instantiation (default values are: PIT \(<10^{-4}\) or PIT \(>0.9999\)).
pit_out_rate = metamets['outlier_rate']
print(f"PIT outlier rate of this sample: {pit_out_rate:.6f}")
pit_out_rate = pitobj.evaluate_PIT_outlier_rate()
print(f"PIT outlier rate of this sample: {pit_out_rate:.6f}")
PIT outlier rate of this sample: 0.058738
PIT outlier rate of this sample: 0.058738
PIT-QQ plot
The histogram of PIT values is a useful tool for a qualitative assessment of PDFs quality. It shows whether the PDFs are: * biased (tilted PIT histogram) * under-dispersed (excess counts close to the boudaries 0 and 1) * over-dispersed (lack of counts close the boudaries 0 and 1) * well-calibrated (flat histogram)
Following the standards in DC1 paper, the PIT histogram is accompanied by the quantile-quantile (QQ), which can be used to compare qualitatively the PIT distribution obtained with the PDFs agaist the ideal case (uniform distribution). The closer the QQ plot is to the diagonal, the better is the PDFs calibration.
pdfs = fzdata.objdata['yvals']
plot_pit_qq(pdfs, zgrid, ztrue, title="PIT-QQ - toy data", code="FlexZBoost",
pit_out_rate=pit_out_rate, savefig=False)
The black horizontal line represents the ideal case where the PIT histogram would behave as a uniform distribution U(0,1).
Summary statistics of CDF-based metrics
To evaluate globally the quality of PDFs estimates, rail.evaluation
provides a set of metrics to compare the empirical distributions of PIT
values with the reference uniform distribution, U(0,1).
Kolmogorov-Smirnov
Let’s start with the traditional Kolmogorov-Smirnov (KS) statistic test, which is the maximum difference between the empirical and the expected cumulative distributions of PIT values:
Where \(\hat{f}\) is the PIT distribution and \(\tilde{f}\) is
U(0,1). Therefore, the smaller value of KS the closer the PIT
distribution is to be uniform. The evaluate method of the PITKS
class returns a named tuple with the statistic and p-value.
ks_stat_and_pval = metamets['ks']
print(f"PIT KS stat and pval: {ks_stat_and_pval}")
ks_stat_and_pval = pitobj.evaluate_PIT_KS()
print(f"PIT KS stat and pval: {ks_stat_and_pval}")
PIT KS stat and pval: KstestResult(statistic=np.float64(0.033590049370962216), pvalue=np.float64(1.7621068075751534e-20), statistic_location=np.float64(0.9921210288809627), statistic_sign=np.int8(-1))
PIT KS stat and pval: KstestResult(statistic=np.float64(0.033590049370962216), pvalue=np.float64(1.7621068075751534e-20), statistic_location=np.float64(0.9921210288809627), statistic_sign=np.int8(-1))
Visual interpretation of the KS statistic:
ks_plot(pitobj)
print(f"KS metric of this sample: {ks_stat_and_pval.statistic:.4f}")
KS metric of this sample: 0.0336
Cramer-von Mises
Similarly, let’s calculate the Cramer-von Mises (CvM) test, a variant of the KS statistic defined as the mean-square difference between the CDFs of an empirical PDF and the true PDFs:
on the distribution of PIT values, which should be uniform if the PDFs are perfect.
cvm_stat_and_pval = metamets['cvm']
print(f"PIT CvM stat and pval: {cvm_stat_and_pval}")
cvm_stat_and_pval = pitobj.evaluate_PIT_CvM()
print(f"PIT CvM stat and pval: {cvm_stat_and_pval}")
PIT CvM stat and pval: CramerVonMisesResult(statistic=9.62335199605935, pvalue=9.265039846440004e-10)
PIT CvM stat and pval: CramerVonMisesResult(statistic=9.62335199605935, pvalue=9.265039846440004e-10)
print(f"CvM metric of this sample: {cvm_stat_and_pval.statistic:.4f}")
CvM metric of this sample: 9.6234
Anderson-Darling
Another variation of the KS statistic is the Anderson-Darling (AD) test, a weighted mean-squared difference featuring enhanced sensitivity to discrepancies in the tails of the distribution.
ad_stat_crit_sig = metamets['ad']
print(f"PIT AD stat and pval: {ad_stat_crit_sig}")
ad_stat_crit_sig = pitobj.evaluate_PIT_anderson_ksamp()
print(f"PIT AD stat and pval: {ad_stat_crit_sig}")
PIT AD stat and pval: Anderson_ksampResult(statistic=np.float64(84.95623553609381), critical_values=array([0.325, 1.226, 1.961, 2.718, 3.752, 4.592, 6.546]), pvalue=np.float64(0.001))
PIT AD stat and pval: Anderson_ksampResult(statistic=np.float64(84.95623553609381), critical_values=array([0.325, 1.226, 1.961, 2.718, 3.752, 4.592, 6.546]), pvalue=np.float64(0.001))
print(f"AD metric of this sample: {ad_stat_crit_sig.statistic:.4f}")
AD metric of this sample: 84.9562
It is possible to remove catastrophic outliers before calculating the integral for the sake of preserving numerical instability. For instance, Schmidt et al. computed the Anderson-Darling statistic within the interval (0.01, 0.99).
ad_stat_crit_sig_cut = pitobj.evaluate_PIT_anderson_ksamp(pit_min=0.01, pit_max=0.99)
print(f"AD metric of this sample: {ad_stat_crit_sig.statistic:.4f}")
print(f"AD metric for 0.01 < PIT < 0.99: {ad_stat_crit_sig_cut.statistic:.4f}")
WARNING:root:Removed 1760 PITs from the sample.
AD metric of this sample: 84.9562
AD metric for 0.01 < PIT < 0.99: 89.9826
CDE Loss
In the absence of true photo-z posteriors, the metric used to evaluate individual PDFs is the Conditional Density Estimate (CDE) Loss, a metric analogue to the root-mean-squared-error:
where \(f(z | x)\) is the true photo-z PDF and \(\hat{f}(z | x)\) is the estimated PDF in terms of the photometry \(x\). Since \(f(z | x)\) is unknown, we estimate the CDE Loss as described in Izbicki & Lee, 2017 (arXiv:1704.08095). :
where the first term is the expectation value of photo-z posterior with respect to the marginal distribution of the covariates X, and the second term is the expectation value with respect to the joint distribution of observables X and the space Z of all possible redshifts (in practice, the centroids of the PDF bins), and the third term is a constant depending on the true conditional densities \(f(z | x)\).
cdelossobj = CDELoss(fzdata, zgrid, ztrue)
cde_stat_and_pval = cdelossobj.evaluate()
cde_stat_and_pval
stat_and_pval(statistic=np.float64(-6.725602928688286), p_value=nan)
print(f"CDE loss of this sample: {cde_stat_and_pval.statistic:.2f}")
CDE loss of this sample: -6.73
We note that all of the quantities as run individually are identical to the quantities in our summary table - a nice check that things have run properly.