stats

Reference API related to statistical functions

Bootstrap estimates


source

bootstrap_sampling

 bootstrap_sampling (data:pandas.core.frame.DataFrame,
                     estimator:Callable=<function mean at 0x7fb085310df0>,
                     n_boot:int=1000, columns_to_exclude:List[str]=None)

Compute bootstrap estimates of the data distribution

Type Default Details
data DataFrame Data containing the columns we want to generate bootstrap estimates from.
estimator typing.Callable mean estimator function that accepts an array-like argument.
n_boot int 1000 Number of bootstrap estimates to compute.
columns_to_exclude typing.List[str] None Column names to exclude.

Usage:

Generate data with columns containing data that we want to compute estimates from. The values in the column a comes from Normal distribution with mean 0 and standard deviation 1. The values from column b comes from Normal distribution with mean 100 and standard deviation 10.

data = pd.DataFrame(
    data={
        "a": np.random.normal(size = 100), 
        "b": np.random.normal(loc=100, scale = 10, size = 100)
    }
)
data.head()
a b
0 0.605639 92.817505
1 -0.775791 92.750026
2 -1.265231 107.981771
3 0.981306 101.388385
4 0.029075 122.700172

Compute mean of the distribution by default

By default, the function generates the mean of each column n_boot times. Each value represents the mean obtained from a bootstrap sample of the original data.

estimates = bootstrap_sampling(data, n_boot=100)
estimates
a b
0 0.012356 100.018394
1 0.143189 100.691872
2 -0.002554 99.874399
3 0.079395 99.539636
4 0.055096 100.452383
... ... ...
95 0.063409 100.439363
96 -0.024455 98.607045
97 0.209427 99.866736
98 0.061323 98.680469
99 0.289456 99.980295

100 rows × 2 columns

We can check if the estimates make sense by compute the mean of the bootstrap estimates and comparing with the mean of the Normal distribution they were generated from.

estimates.mean()
a      0.089538
b    100.099900
dtype: float64

Specify function. Example: Standard deviation.

We can specify other functions, such as np.std to compute the standard deviation.

estimates = bootstrap_sampling(data, estimator=np.std, n_boot=100)
estimates
a b
0 0.933496 10.126658
1 0.929125 9.852667
2 0.899762 10.307814
3 0.968039 10.416074
4 1.004349 10.441463
... ... ...
95 0.910904 10.357727
96 0.818276 12.358640
97 0.981826 9.622724
98 0.962237 10.897055
99 0.913994 11.096338

100 rows × 2 columns

If we take the mean of the bootstrap estimates of the standard deviation, we should recover a value close to the standard deviation of the distribution that the data were generated from.

estimates.mean()
a     0.943942
b    10.480457
dtype: float64

Exclude unwanted columns

estimates = bootstrap_sampling(
    data, n_boot=100, columns_to_exclude=["b"]
)
estimates
a
0 0.259128
1 0.098232
2 0.087111
3 -0.131376
4 0.050997
... ...
95 0.129835
96 -0.004873
97 -0.046338
98 0.246239
99 0.355848

100 rows × 1 columns


source

compute_evaluation_estimates

 compute_evaluation_estimates (df:pandas.core.frame.DataFrame,
                               n_boot:int=1000,
                               estimator:Callable=<function mean at
                               0x7fb085310df0>, quantile_low:float=0.025,
                               quantile_high=0.975)

Compute estimate and confidence interval for evaluation per query metrics.

Type Default Details
df DataFrame Evaluations per query data, usually obtained pyvespa evaluate method.
n_boot int 1000 Number of bootstrap samples.
estimator typing.Callable mean estimator function that accepts an array-like argument.
quantile_low float 0.025 lower quantile to compute confidence interval
quantile_high float 0.975 upper quantile to compute confidence interval

Usage:

Generate sample data frame, which must contain the column model.

number_data_points = 1000
data = pd.DataFrame(
    data = {
        "model": (
            ["A"] * number_data_points + 
            ["B"] * number_data_points
        ),
        "query_id": (
            list(range(number_data_points)) + 
            list(range(number_data_points))
        ),
        "metric_1": (
            np.random.binomial(size=number_data_points, n=1, p=0.3).tolist() + 
            np.random.binomial(size=number_data_points, n=1, p=0.7).tolist()
        ),
        "metric_2": (
            np.random.binomial(size=number_data_points, n=1, p=0.1).tolist() + 
            np.random.binomial(size=number_data_points, n=1, p=0.9).tolist()
        )
        
    }
).sort_values("query_id").reset_index(drop=True)
data
model query_id metric_1 metric_2
0 A 0 0 0
1 B 0 1 1
2 A 1 0 1
3 B 1 1 1
4 A 2 0 0
... ... ... ... ...
1995 A 997 1 0
1996 B 998 1 1
1997 A 998 1 0
1998 A 999 0 0
1999 B 999 0 1

2000 rows × 4 columns

Compute the confidence interval of the mean by default

compute_evaluation_estimates(data)
metric model low median high
0 metric_1 A 0.268000 0.296 0.325
1 metric_1 B 0.667000 0.696 0.724
2 metric_2 A 0.091000 0.109 0.129
3 metric_2 B 0.887975 0.907 0.924

Specify function. Example: Standard deviation.

compute_evaluation_estimates(data, estimator=np.std)
metric model low median high
0 metric_1 A 0.442918 0.456491 0.468375
1 metric_1 B 0.448001 0.459983 0.470931
2 metric_2 A 0.289026 0.311639 0.335200
3 metric_2 B 0.264998 0.291829 0.315366

Specify interval coverage

compute_evaluation_estimates(
    data, 
    quantile_low=0.2, 
    quantile_high=0.8
)
metric model low median high
0 metric_1 A 0.285 0.296 0.308
1 metric_1 B 0.684 0.696 0.708
2 metric_2 A 0.102 0.110 0.118
3 metric_2 B 0.898 0.906 0.914
compute_evaluation_estimates(data[["model", "metric_1", "metric_2"]])
metric model low median high
0 metric_1 A 0.269975 0.297 0.326000
1 metric_1 B 0.667975 0.696 0.726000
2 metric_2 A 0.091000 0.109 0.129025
3 metric_2 B 0.888000 0.907 0.923000