jackknife
- class jackknife.Jackknife(input_sample_generator_function, n_samples, transformation_function, correlation_axis=- 1, output_file_prefix='', output_samples_are_stored=False)
A class implementing a jackknife estimation.
The jackknife is a resampling method that allows to estimate the variance and bias of a parameter. This can further be used to obtain the bias-corrected jackknife estimator.
Given \(n\) input samples \(x_i\) and a transformation function \(f()\), the goal is to estimate \(y=f(x)\), with \(x\) and \(y\) being the true values. A simple estimate would just be the transformed input mean \(f(\bar{x})\), with \(\bar{x}\) being the input sample mean. In general this carries a bias of order \(1/n\). Using jackknife resampling this leading \(1/n\) term can be estimated and removed, yielding the bias-corrected jackknife estimator.
First, the resampling of the input is done by
\[x_i \rightarrow x'_i = (n\bar{x} - x_i) / (n-1),\]where \(x'_i\) is the new sample. This is then transformed in the following way:
\[y_i = n f(\bar{x}) - (n-1) f(x'_i),\]where \(y_i\) is the transformed, bias-corrected output sample. The jackknife estimator \(\hat{y}_{\textrm{jackknife}}\) can now be obtained simply by calculating the sample mean of the \(y_i\). Other statistical quantities like the variance and covariance for example can also be calculated from the output samples. For more details on that, see
do_estimation()
.Both \(x_i\) and \(y_i\) can be scalar or high-dimensional quantities. To allow jackknifing multiple observables at once, the output samples must be a dictionary of numbers or numpy arrays, where each entry represents a different observable. The input samples are either numbers or numpy arrays. In the latter case, naturally, all calculations are done element-wise.
- Variables
input_sample_generator_function – Function that returns a generator yielding the input samples \(x_i\). An input sample must either be a number or a numpy array.
n_samples (int) – \(n\)
transformation_function – \(f: x_i\rightarrow y_i\). Callable that takes a single input sample \(x_i\) as an argument and returns an output sample \(y_i\). An output sample must be a dictionary of numbers or numpy arrays. The keys in this dictionary are used as names for the observables in the output file (see
write_results_to_file()
) and the number of observables is deduced from the dictionary’s length.correlation_axis – Axis along which the outer product for the covariance and correlation matrix is calculated. Must be an integer or an iterable with a length equal to the number of observables.
output_file_prefix –
stores_output_samples – If
True
the output samples \(y_i\) are also stored in the HDF5 output file.
- Raises
ValueError – if \(n<2\)
- property output_file_name
`output_file_prefix`_<YYYY-mm-dd_HH-MM-SS>.hdf5
- do_estimation()
Estimate \(y=f(x)\) and some common statistics.
The jackknife estimator \(\hat{y}_{\textrm{jackknife}}\) is simply given by the sample mean \(\bar{y}\),
\[\hat{y}_{\textrm{jackknife}} = \bar{y} = \frac{1}{n} \sum_i y_i,\]with output samples \(y_i\) and sample size \(n\). Other common statistical quantities are estimated with the sample variance \(s^2\), the sample standard deviation \(s\), the standard error of the mean \(\textrm{SEM}\), the sample covariance matrix \(q\) and the matrix of sample Pearson correlation coefficients \(r\). They are calculated using the following formulas:
\[\begin{split}s^2 &= \frac{1}{n-1}\left(\sum_i y_i^2 - \frac{1}{n} \left(\sum_i y_i\right)^2\right),\\ s &= \sqrt{s^2},\\ \textrm{SEM} &= \frac{s}{\sqrt{n}},\\ q &= \frac{1}{n-1} \left(\sum_i y_i\otimes y_i - \frac{1}{n}\left(\sum_i y_i\right) \otimes \left(\sum_i y_i\right)\right),\\ r_{ij} &= \frac{q_{ij}}{s_i s_j}.\end{split}\]Here, \(\otimes\) denotes the outer product along the
correlation_axis
. Since \(y\) is implemented as a dictionary potentially holding multiple independent observables, naturally, the above statistics are calculated separately for each entry.Note
While the sample variance \(s^2\) is an unbiased estimator, the sample standard deviation \(s\) is not. It generally underestimates the standard deviation. Because of that, also the standard error of the mean is underestimated.
- write_results_to_file()
Write the results of the last jackknife estimation to a file.
Create an HDF5 file with the following structure:
`output_file_prefix`_<YYYY-mm-dd_HH-MM-SS>.hdf5 ├── .config │ ├── jk.n_samples │ └── jk.store_output_samples ├── <name_of_observable_1> │ ├── mean │ ├── variance │ ├── standard_deviation │ ├── standard_error_of_mean │ ├── transformed_input_mean │ ├── covariance │ └── correlation ├── <name_of_observable_2> │ ├── mean : :
The time stamp is from the start of the last jackknife estimation and the keys of the dictionary returned by
transformation_function
are used for the names of the observables.Note
User code can and should expand the
.config
group with problem-specific metadata, e.g., with information about the input samples or the transformation function.