jackknife

class jackknife.Jackknife(input_sample_generator_function, n_samples, transformation_function, correlation_axis=- 1, output_file_prefix='', output_samples_are_stored=False)

A class implementing a jackknife estimation.

The jackknife is a resampling method that allows to estimate the variance and bias of a parameter. This can further be used to obtain the bias-corrected jackknife estimator.

Given \(n\) input samples \(x_i\) and a transformation function \(f()\), the goal is to estimate \(y=f(x)\), with \(x\) and \(y\) being the true values. A simple estimate would just be the transformed input mean \(f(\bar{x})\), with \(\bar{x}\) being the input sample mean. In general this carries a bias of order \(1/n\). Using jackknife resampling this leading \(1/n\) term can be estimated and removed, yielding the bias-corrected jackknife estimator.

First, the resampling of the input is done by

\[x_i \rightarrow x'_i = (n\bar{x} - x_i) / (n-1),\]

where \(x'_i\) is the new sample. This is then transformed in the following way:

\[y_i = n f(\bar{x}) - (n-1) f(x'_i),\]

where \(y_i\) is the transformed, bias-corrected output sample. The jackknife estimator \(\hat{y}_{\textrm{jackknife}}\) can now be obtained simply by calculating the sample mean of the \(y_i\). Other statistical quantities like the variance and covariance for example can also be calculated from the output samples. For more details on that, see do_estimation().

Both \(x_i\) and \(y_i\) can be scalar or high-dimensional quantities. To allow jackknifing multiple observables at once, the output samples must be a dictionary of numbers or numpy arrays, where each entry represents a different observable. The input samples are either numbers or numpy arrays. In the latter case, naturally, all calculations are done element-wise.

Variables

input_sample_generator_function – Function that returns a generator yielding the input samples \(x_i\). An input sample must either be a number or a numpy array.
n_samples (int) – \(n\)
transformation_function – \(f: x_i\rightarrow y_i\). Callable that takes a single input sample \(x_i\) as an argument and returns an output sample \(y_i\). An output sample must be a dictionary of numbers or numpy arrays. The keys in this dictionary are used as names for the observables in the output file (see write_results_to_file()) and the number of observables is deduced from the dictionary’s length.
correlation_axis – Axis along which the outer product for the covariance and correlation matrix is calculated. Must be an integer or an iterable with a length equal to the number of observables.
output_file_prefix –
stores_output_samples – If True the output samples \(y_i\) are also stored in the HDF5 output file.

Raises

ValueError – if \(n<2\)

property output_file_name: `output_file_prefix`_<YYYY-mm-dd_HH-MM-SS>.hdf5

do_estimation()

Estimate \(y=f(x)\) and some common statistics.

The jackknife estimator \(\hat{y}_{\textrm{jackknife}}\) is simply given by the sample mean \(\bar{y}\),

\[\hat{y}_{\textrm{jackknife}} = \bar{y} = \frac{1}{n} \sum_i y_i,\]

with output samples \(y_i\) and sample size \(n\). Other common statistical quantities are estimated with the sample variance \(s^2\), the sample standard deviation \(s\), the standard error of the mean \(\textrm{SEM}\), the sample covariance matrix \(q\) and the matrix of sample Pearson correlation coefficients \(r\). They are calculated using the following formulas:

\[\begin{split}s^2 &= \frac{1}{n-1}\left(\sum_i y_i^2 - \frac{1}{n} \left(\sum_i y_i\right)^2\right),\\ s &= \sqrt{s^2},\\ \textrm{SEM} &= \frac{s}{\sqrt{n}},\\ q &= \frac{1}{n-1} \left(\sum_i y_i\otimes y_i - \frac{1}{n}\left(\sum_i y_i\right) \otimes \left(\sum_i y_i\right)\right),\\ r_{ij} &= \frac{q_{ij}}{s_i s_j}.\end{split}\]

Here, \(\otimes\) denotes the outer product along the correlation_axis. Since \(y\) is implemented as a dictionary potentially holding multiple independent observables, naturally, the above statistics are calculated separately for each entry.

Note

While the sample variance \(s^2\) is an unbiased estimator, the sample standard deviation \(s\) is not. It generally underestimates the standard deviation. Because of that, also the standard error of the mean is underestimated.

write_results_to_file()

Write the results of the last jackknife estimation to a file.

Create an HDF5 file with the following structure:

`output_file_prefix`_<YYYY-mm-dd_HH-MM-SS>.hdf5
        ├── .config
        │       ├── jk.n_samples
        │       └── jk.store_output_samples
        ├── <name_of_observable_1>
        │       ├── mean
        │       ├── variance
        │       ├── standard_deviation
        │       ├── standard_error_of_mean
        │       ├── transformed_input_mean
        │       ├── covariance
        │       └── correlation
        ├── <name_of_observable_2>
        │       ├── mean
        :       :

The time stamp is from the start of the last jackknife estimation and the keys of the dictionary returned by transformation_function are used for the names of the observables.

Note

User code can and should expand the .config group with problem-specific metadata, e.g., with information about the input samples or the transformation function.