FlexZBoost PDF Representation Comparison

Author: Drew Oldag

Last Run Successfully: September 26, 2023

This notebook does a quick comparison of storage requirements for Flexcode output using two different storage techniques. We’ll compare qp.interp (x,y interpolated) output against the native parameterization of qp_flexzboost.

Note: If you’re planning to run this in a notebook, you may want to use interactive mode instead. See FlexZBoost_PDF_Representation_Comparison.ipynb in the interactive_examples/estimation_examples/ folder for a version of this notebook in interactive mode.

import os
import matplotlib.pyplot as plt
import numpy as np

import qp
import tables_io

from rail.core.data import TableHandle
from rail.core.stage import RailStage

%matplotlib inline

Create references to the training and test data.

from rail.utils.path_utils import find_rail_file
trainFile = find_rail_file('examples_data/testdata/test_dc2_training_9816.hdf5')
testFile = find_rail_file('examples_data/testdata/test_dc2_validation_9816.hdf5')
training_data = tables_io.read(trainFile)
test_data = tables_io.read(testFile)

Define the configurations for the ML model to be trained by Flexcode. Specifically we’ll use Xgboost with a set of 35 cosine basis functions.

fz_dict = dict(zmin=0.0, zmax=3.0, nzbins=301,
               trainfrac=0.75, bumpmin=0.02, bumpmax=0.35,
               nbump=20, sharpmin=0.7, sharpmax=2.1, nsharp=15,
               max_basis=35, basis_system='cosine',
               hdf5_groupname='photometry',
               regression_params={'max_depth': 8,'objective':'reg:squarederror'})

fz_modelfile = 'demo_FZB_model.pkl'

Define the RAIL stage to train the model

from rail.estimation.algos.flexzboost import FlexZBoostInformer, FlexZBoostEstimator
inform_pzflex = FlexZBoostInformer.make_stage(name='inform_fzboost', model=fz_modelfile, **fz_dict)

Then we’ll run that stage to train the model and store the result in a file name demo_FZB_model.pkl.

%%time
inform_pzflex.inform(training_data)
Inserting handle into data store.  input: None, inform_fzboost
stacking some data...
read in training data
fit the model...
finding best bump thresh...
finding best sharpen parameter...
Retraining with full training set...
Best bump = 0.08947368421052632, best sharpen = 1.2
Inserting handle into data store.  model_inform_fzboost: inprogress_demo_FZB_model.pkl, inform_fzboost
CPU times: user 57.5 s, sys: 665 ms, total: 58.1 s
Wall time: 1min
<rail.core.data.ModelHandle at 0x7fe6f535df90>

Now we configure the RAIL stage that will evaluate test data using the saved model. Note that we specify qp_representation='flexzboost' here to instruct rail_flexzboost to store the model weights using qp_flexzboost.

pzflex_qp_flexzboost = FlexZBoostEstimator.make_stage(name='fzboost_flexzboost', hdf5_groupname='photometry',
                            model=inform_pzflex.get_handle('model'),
                            output='flexzboost.hdf5',
                            qp_representation='flexzboost')

Now we actually evaluate the test data, 20,449 example galaxies, using the trained model, and then print out the size of the file that was saved.

Note that the final output size will depend on the number of basis functions used by the model. Again, for this experiment, we used 35 basis functions.

%%time
output_file_name = './flexzboost.hdf5'
try:
    os.unlink(output_file_name)
except FileNotFoundError:
    pass

fzresults_qp_flexzboost = pzflex_qp_flexzboost.estimate(test_data)
file_size = os.path.getsize(output_file_name)
print("File Size is :", file_size, "bytes")
Inserting handle into data store.  input: None, fzboost_flexzboost
Inserting handle into data store.  model_inform_fzboost: <class 'rail.core.data.ModelHandle'> demo_FZB_model.pkl, (wd), fzboost_flexzboost
Process 0 running estimator on chunk 0 - 20,449
Process 0 estimating PZ PDF for rows 0 - 20,449
Inserting handle into data store.  output_fzboost_flexzboost: inprogress_flexzboost.hdf5, fzboost_flexzboost
File Size is : 3199236 bytes
CPU times: user 225 ms, sys: 32.8 ms, total: 258 ms
Wall time: 741 ms

Example calculating median and mode. Note that we’re using the %%timeit magic command to get an estimate of the time required for calculating median, but we’re using %%time to estimate the mode. This is because qp will cache the output of the pdf function for a given grid. If we used %%timeit, then the resulting estimate would average the run time of one non-cached calculation and N-1 cached calculations.

zgrid = np.linspace(0, 3., 301)
%%time
fz_medians_qp_flexzboost = fzresults_qp_flexzboost().median()
CPU times: user 8.83 s, sys: 19.5 ms, total: 8.85 s
Wall time: 8.49 s
%%time
fz_modes_qp_flexzboost = fzresults_qp_flexzboost().mode(grid=zgrid)
CPU times: user 10.9 s, sys: 29.5 ms, total: 10.9 s
Wall time: 10.5 s

Plotting median values.

fz_medians_qp_flexzboost = fzresults_qp_flexzboost().median()

plt.hist(fz_medians_qp_flexzboost, bins=np.linspace(-.005,3.005,101));
plt.xlabel("redshift")
plt.ylabel("Number")
Text(0, 0.5, 'Number')
../../../_images/01_FlexZBoost_PDF_Representation_Comparison_19_1.png

Example convertion to a qp.hist histogram representation.

%%timeit
bins = np.linspace(0, 3, 301)
fzresults_qp_flexzboost().convert_to(qp.hist_gen, bins=bins)
10.5 s ± 61.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now we’ll repeat the experiment using qp.interp storage. Again, we’ll define the RAIL stage to evaluate the test data using the saved model, but instruct rail_flexzboost to store the output as x,y interpolated values using qp.interp.

pzflex_qp_interp = FlexZBoostEstimator.make_stage(name='fzboost_interp', hdf5_groupname='photometry',
                            model=inform_pzflex.get_handle('model'),
                            output='interp.hdf5',
                            qp_representation='interp',
                            calculated_point_estimates=[])

Finally we evaluate the test data again using the trained model, and then print out the size of the file that was saved using the x,y interpolated technique.

The final file size will depend on the size of the x grid that defines the interpolation. However, we can see that in order to match the storage requirements of qp_flexzboost, the x grid would need to be smaller than the number of basis functions used by the model. For this experiment, we used 35 basis functions.

%%time
output_file_name = './interp.hdf5'
try:
    os.unlink(output_file_name)
except FileNotFoundError:
    pass

fzresults_qp_interp = pzflex_qp_interp.estimate(test_data)
file_size = os.path.getsize(output_file_name)
print("File Size is :", file_size, "bytes")
Inserting handle into data store.  input: None, fzboost_interp
Inserting handle into data store.  model_inform_fzboost: <class 'rail.core.data.ModelHandle'> demo_FZB_model.pkl, (wd), fzboost_interp
Process 0 running estimator on chunk 0 - 20,449
Process 0 estimating PZ PDF for rows 0 - 20,449
Inserting handle into data store.  output_fzboost_interp: inprogress_interp.hdf5, fzboost_interp
File Size is : 49576854 bytes
CPU times: user 11 s, sys: 138 ms, total: 11.1 s
Wall time: 11.3 s

Example calculating median and mode. Note that we’re using the %%timeit magic command to get an estimate of the time required for calculating median, but we’re using %%time to estimate the mode. This is because qp will cache the output of the pdf function for a given grid. If we used %%timeit, then the resulting estimate would average the run time of one non-cached calculation and N-1 cached calculations.

zgrid = np.linspace(0, 3., 301)
%%timeit
fz_medians_qp_interp = fzresults_qp_interp().median()
877 ms ± 3.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%time
fz_modes_qp_interp = fzresults_qp_interp().mode(grid=zgrid)
CPU times: user 210 ms, sys: 49.3 ms, total: 259 ms
Wall time: 259 ms

Plotting median values.

fz_medians_qp_interp = fzresults_qp_interp().median()
plt.hist(fz_medians_qp_interp, bins=np.linspace(-.005,3.005,101));
plt.xlabel("redshift")
plt.ylabel("Number")
Text(0, 0.5, 'Number')
../../../_images/01_FlexZBoost_PDF_Representation_Comparison_31_1.png

Example convertion to a qp.hist histogram representation.

%%timeit
bins = np.linspace(0, 3, 301)
fzresults_qp_interp().convert_to(qp.hist_gen, bins=bins)
311 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We’ll clean up the files that were produced: the model pickle file, and the output data file.

model_file_name = 'demo_FZB_model.pkl'
flexzboost_file_name = './flexzboost.hdf5'
interp_file_name = './interp.hdf5'

try:
    os.unlink(model_file_name)
except FileNotFoundError:
    pass

try:
    os.unlink(flexzboost_file_name)
except FileNotFoundError:
    pass

try:
    os.unlink(interp_file_name)
except FileNotFoundError:
    pass