Refactor all parametric tests to optionally accept summary statistics of samples (instead of an array of raw observations for each sample, as done today)

JohnAbe · September 11, 2024, 12:53pm

For parametric tests in scipy.stats, (eg: ttest_ind, dunnett), the first step is to use the passed raw data (values for each observation) to find summary statistics (values for each sample). Then these summary statistics are used in subsequent steps to determine the test statistic, p-value, etc… For real-world data involving a large number of observations, users may only have the summary statistics from each sample (eg: many-to-one comparisons for online advertising activity involving millions of records).

One way for SciPy to support tests when only sample statistics are available is to introduce _from_stats version for each test where it is possible (like ttest_ind_from_stats for ttest_ind), but that approach may duplicate a significant chunk of code.

An different approach, recommended by @mdhaber is to refactor all applicable tests into two portions -

Given raw data, find summary statistics or have methods that can readily calculate these when needed (eg: to calculate - sample mean, standard deviation, mean absolute deviation, etc…)
Given sample summary statistics, perform a statistical test (eg: Two sample t-test). This object should be able to produce a CI for the test statistic, p-value, etc…

Since this may involve a considerable effort refactoring code for all tests that can follow such an approach, wanted to seek feedback/suggestions from this forum before significant time investment in this direction.

For Context:
This guidance came up when I raised a PR for a new function - dunnett_from_stats ( dunnett_from_stats() is to dunnett() what ttest_ind_from_stats() is to ttest_ind())