FlexZBoost PDF Representation Comparison
Author: Drew Oldag
Last Run Successfully: September 26, 2023
This notebook does a quick comparison of storage requirements for
Flexcode output using two different storage techniques. We’ll compare
qp.interp (x,y interpolated) output against the native
parameterization of qp_flexzboost.
Note: If you’re planning to run this in a notebook, you may want to
use interactive mode instead. See
FlexZBoost_PDF_Representation_Comparison.ipynb
in the interactive_examples/estimation_examples/ folder for a
version of this notebook in interactive mode.
import os
import matplotlib.pyplot as plt
import numpy as np
import qp
import tables_io
from rail.core.data import TableHandle
from rail.core.stage import RailStage
%matplotlib inline
Create references to the training and test data.
from rail.utils.path_utils import find_rail_file
trainFile = find_rail_file('examples_data/testdata/test_dc2_training_9816.hdf5')
testFile = find_rail_file('examples_data/testdata/test_dc2_validation_9816.hdf5')
training_data = tables_io.read(trainFile)
test_data = tables_io.read(testFile)
Define the configurations for the ML model to be trained by Flexcode. Specifically we’ll use Xgboost with a set of 35 cosine basis functions.
fz_dict = dict(zmin=0.0, zmax=3.0, nzbins=301,
trainfrac=0.75, bumpmin=0.02, bumpmax=0.35,
nbump=20, sharpmin=0.7, sharpmax=2.1, nsharp=15,
max_basis=35, basis_system='cosine',
hdf5_groupname='photometry',
regression_params={'max_depth': 8,'objective':'reg:squarederror'})
fz_modelfile = 'demo_FZB_model.pkl'
Define the RAIL stage to train the model
from rail.estimation.algos.flexzboost import FlexZBoostInformer, FlexZBoostEstimator
inform_pzflex = FlexZBoostInformer.make_stage(name='inform_fzboost', model=fz_modelfile, **fz_dict)
Then we’ll run that stage to train the model and store the result in a
file name demo_FZB_model.pkl.
%%time
inform_pzflex.inform(training_data)
Inserting handle into data store. input: None, inform_fzboost
stacking some data...
read in training data
fit the model...
finding best bump thresh...
finding best sharpen parameter...
Retraining with full training set...
Best bump = 0.08947368421052632, best sharpen = 1.2
Inserting handle into data store. model_inform_fzboost: inprogress_demo_FZB_model.pkl, inform_fzboost
CPU times: user 57.5 s, sys: 665 ms, total: 58.1 s
Wall time: 1min
<rail.core.data.ModelHandle at 0x7fe6f535df90>
Now we configure the RAIL stage that will evaluate test data using the
saved model. Note that we specify qp_representation='flexzboost'
here to instruct rail_flexzboost to store the model weights using
qp_flexzboost.
pzflex_qp_flexzboost = FlexZBoostEstimator.make_stage(name='fzboost_flexzboost', hdf5_groupname='photometry',
model=inform_pzflex.get_handle('model'),
output='flexzboost.hdf5',
qp_representation='flexzboost')
Now we actually evaluate the test data, 20,449 example galaxies, using the trained model, and then print out the size of the file that was saved.
Note that the final output size will depend on the number of basis functions used by the model. Again, for this experiment, we used 35 basis functions.
%%time
output_file_name = './flexzboost.hdf5'
try:
os.unlink(output_file_name)
except FileNotFoundError:
pass
fzresults_qp_flexzboost = pzflex_qp_flexzboost.estimate(test_data)
file_size = os.path.getsize(output_file_name)
print("File Size is :", file_size, "bytes")
Inserting handle into data store. input: None, fzboost_flexzboost
Inserting handle into data store. model_inform_fzboost: <class 'rail.core.data.ModelHandle'> demo_FZB_model.pkl, (wd), fzboost_flexzboost
Process 0 running estimator on chunk 0 - 20,449
Process 0 estimating PZ PDF for rows 0 - 20,449
Inserting handle into data store. output_fzboost_flexzboost: inprogress_flexzboost.hdf5, fzboost_flexzboost
File Size is : 3199236 bytes
CPU times: user 225 ms, sys: 32.8 ms, total: 258 ms
Wall time: 741 ms
Example calculating median and mode. Note that we’re using the
%%timeit magic command to get an estimate of the time required for
calculating median, but we’re using %%time to estimate the
mode. This is because qp will cache the output of the pdf
function for a given grid. If we used %%timeit, then the resulting
estimate would average the run time of one non-cached calculation and
N-1 cached calculations.
zgrid = np.linspace(0, 3., 301)
%%time
fz_medians_qp_flexzboost = fzresults_qp_flexzboost().median()
CPU times: user 8.83 s, sys: 19.5 ms, total: 8.85 s
Wall time: 8.49 s
%%time
fz_modes_qp_flexzboost = fzresults_qp_flexzboost().mode(grid=zgrid)
CPU times: user 10.9 s, sys: 29.5 ms, total: 10.9 s
Wall time: 10.5 s
Plotting median values.
fz_medians_qp_flexzboost = fzresults_qp_flexzboost().median()
plt.hist(fz_medians_qp_flexzboost, bins=np.linspace(-.005,3.005,101));
plt.xlabel("redshift")
plt.ylabel("Number")
Text(0, 0.5, 'Number')
Example convertion to a qp.hist histogram representation.
%%timeit
bins = np.linspace(0, 3, 301)
fzresults_qp_flexzboost().convert_to(qp.hist_gen, bins=bins)
10.5 s ± 61.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now we’ll repeat the experiment using qp.interp storage. Again,
we’ll define the RAIL stage to evaluate the test data using the saved
model, but instruct rail_flexzboost to store the output as x,y
interpolated values using qp.interp.
pzflex_qp_interp = FlexZBoostEstimator.make_stage(name='fzboost_interp', hdf5_groupname='photometry',
model=inform_pzflex.get_handle('model'),
output='interp.hdf5',
qp_representation='interp',
calculated_point_estimates=[])
Finally we evaluate the test data again using the trained model, and then print out the size of the file that was saved using the x,y interpolated technique.
The final file size will depend on the size of the x grid that defines
the interpolation. However, we can see that in order to match the
storage requirements of qp_flexzboost, the x grid would need to be
smaller than the number of basis functions used by the model. For this
experiment, we used 35 basis functions.
%%time
output_file_name = './interp.hdf5'
try:
os.unlink(output_file_name)
except FileNotFoundError:
pass
fzresults_qp_interp = pzflex_qp_interp.estimate(test_data)
file_size = os.path.getsize(output_file_name)
print("File Size is :", file_size, "bytes")
Inserting handle into data store. input: None, fzboost_interp
Inserting handle into data store. model_inform_fzboost: <class 'rail.core.data.ModelHandle'> demo_FZB_model.pkl, (wd), fzboost_interp
Process 0 running estimator on chunk 0 - 20,449
Process 0 estimating PZ PDF for rows 0 - 20,449
Inserting handle into data store. output_fzboost_interp: inprogress_interp.hdf5, fzboost_interp
File Size is : 49576854 bytes
CPU times: user 11 s, sys: 138 ms, total: 11.1 s
Wall time: 11.3 s
Example calculating median and mode. Note that we’re using the
%%timeit magic command to get an estimate of the time required for
calculating median, but we’re using %%time to estimate the
mode. This is because qp will cache the output of the pdf
function for a given grid. If we used %%timeit, then the resulting
estimate would average the run time of one non-cached calculation and
N-1 cached calculations.
zgrid = np.linspace(0, 3., 301)
%%timeit
fz_medians_qp_interp = fzresults_qp_interp().median()
877 ms ± 3.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%time
fz_modes_qp_interp = fzresults_qp_interp().mode(grid=zgrid)
CPU times: user 210 ms, sys: 49.3 ms, total: 259 ms
Wall time: 259 ms
Plotting median values.
fz_medians_qp_interp = fzresults_qp_interp().median()
plt.hist(fz_medians_qp_interp, bins=np.linspace(-.005,3.005,101));
plt.xlabel("redshift")
plt.ylabel("Number")
Text(0, 0.5, 'Number')
Example convertion to a qp.hist histogram representation.
%%timeit
bins = np.linspace(0, 3, 301)
fzresults_qp_interp().convert_to(qp.hist_gen, bins=bins)
311 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
We’ll clean up the files that were produced: the model pickle file, and the output data file.
model_file_name = 'demo_FZB_model.pkl'
flexzboost_file_name = './flexzboost.hdf5'
interp_file_name = './interp.hdf5'
try:
os.unlink(model_file_name)
except FileNotFoundError:
pass
try:
os.unlink(flexzboost_file_name)
except FileNotFoundError:
pass
try:
os.unlink(interp_file_name)
except FileNotFoundError:
pass