[RFC] Adding Cohen’s d effect size function to scipy.stats

Hi SciPy community,

I’d like to propose adding a Cohen’s d effect size function to scipy.stats.

Background

Cohen’s d is a widely used standardized effect size measure in statistical analysis,
particularly common in psychology, education, and medical research. Currently, users
must either implement it manually or use external packages.

Implementation

I have a complete implementation ready for review:

Key Features

  • Two-sample Cohen’s d: cohens_d(x, y)
  • One-sample Cohen’s d: cohens_d(x, mu=0)
  • Paired samples: cohens_d(x, y, paired=True)
  • Bias correction: cohens_d(x, y, bias_correction=True) (Hedges’ g)
  • Full scipy.stats API compatibility (axis, keepdims, etc.)
  • Comprehensive test coverage (32+ tests)
  • Cross-platform validation via CI/CD

SciPy Integration

The main implementation is in cohens_d_package/cohens_d/core.py and follows
SciPy conventions. Would the community be interested in this addition?

Looking forward to your feedback!

IIUC, Cohen’s d is related quite simply to the t statistic in the two sample case. Is that true in the one sample / paired sample case? I see there is also a bias correction, but that also looks like a factor that depends only on the number of observations.

If so, how about adding an output to the t-test functions? We are able to add attributed to b the c result object in a backward compatible way.

Please consider continuing the discussion on gh-14869, which proposed some other effect size enhancements.

Thank you for the thoughtful question! You are correct: Cohen’s d is mathematically related to the t statistic in all cases (two-sample, one-sample, paired), with the relationship d=t/sqrt n (or similar, depending on design). The bias correction (Hedges’ g) is also a simple multiplicative factor based on sample size.

Cohen’s d is widely used as a standardized effect size, and reporting it alongside t-test results is considered best practice in many fields. Some statistical packages do provide Cohen’s d as part of the t-test output, while others offer it as a separate function for flexibility.

Adding Cohen’s d as an attribute to the t-test result object in SciPy would be a convenient and backward-compatible enhancement. However, Cohen’s d can be calculated in different ways (pooled/unpooled SD, bias correction, paired samples, multidimensional arrays), so a dedicated function also provides users with more control.

I’d be happy to help implement Cohen’s d as an optional output in the t-test functions, or as a dedicated function, depending on what the community prefers!

However, Cohen’s d can be calculated in different ways (pooled/unpooled SD, bias correction, paired samples, multidimensional arrays), so a dedicated function also provides users with more control.

Sure. I’d suggest SciPy’s t-test functions can just output the (uncorrected) Cohen’s d and Hedge’s bias correction factor separately to avoid complicating the API. SciPy, too, already has a function for a paired sample t-test, and all the t-test functions handle multidimensional arrays. There is also an equal_var option that provides some control over how the variance is calculated. So that gets you most of these features - not absolutely exhaustive, but that is that is where downstream packages come in.

Some advantages of going this route is that you would not need to translate your code to the array API, maintainers would have many fewer lines of code/tests/docs (much of which duplicates t-test related calculations) to review, and existing users of the t-test functions would immediately benefit. You would also have a reviewer lined up (me). If the community decides to go this route, feel free to ping me in the PR.

Thank you for the detailed feedback and suggestions!

I agree that adding Cohen’s d and Hedges’ bias correction factor as outputs to the existing t-test functions is the most efficient and user-friendly approach.

I’ll adapt my implementation to calculate and return the uncorrected Cohen’s d and the bias correction factor as additional outputs in the t-test result objects. This should provide immediate value to users and keep the API clean.

I appreciate your offer to review the PR and will ping you when it’s ready. Thanks again for your guidance!

Sounds great. BTW, there is some complicated plumbing in the t-test functions, so I put this branch together to together to show the places that need to be changed to add an output like this:

https://github.com/mdhaber/scipy/pull/new/cohend

I’d suggest getting started with the simplest possible implementation - just replacing cohen_d=prob (which setscohen_d to a dummy value) with the appropriate value, ideally calculated right there from the t statistic. When the PR is in review, we can consider making it more efficient - e.g. reversing the order so t is calculated from cohen_d via the appropriate factor or making cohen_d a property of TTestResult to avoid the computation unless someone needs to access it.

This is getting a bit detailed for discussion in the forum - let’s continue in the PR.

Thanks so much, @mdhaber! This is incredibly helpful for navigating the codebase. I’ll get started on an implementation and open a pull request as you suggested. I’ll be sure to reference your example.

Hi @mdhaber,

Thank you so much for the detailed guidance and the example branch - it’s incredibly helpful to see the implementation approach and understand the codebase structure.

After diving deeper into the implementation and considering the broader implications, I’ve been reflecting on the best approach for Cohen’s d in SciPy. While integration with t-tests would certainly be convenient, I’m now leaning toward proposing Cohen’s d as a standalone function for a few reasons:

Broader applicability: Cohen’s d is valuable beyond t-test contexts (meta-analysis, A/B testing, one-sample comparisons against known population parameters, etc.)

API consistency: Looking at SciPy’s patterns, effect sizes and association measures tend to be standalone functions (somersd, kendalltau, spearmanr) rather than integrated into hypothesis tests

Flexibility: A standalone implementation can support the full range of Cohen’s d variants (paired samples, bias corrections, different pooling strategies) without complicating the t-test APIs

User choice: Researchers can decide when they need effect sizes versus just hypothesis testing

I really appreciate your willingness to guide the integration approach, and I understand if this changes your interest in reviewing. However, I believe a well-designed standalone cohens_d function would be a valuable addition to SciPy’s statistical toolkit.

Additional Context: Why Cohen’s d Matters for SciPy

To provide more context for this RFC, here are some compelling reasons why Cohen’s d belongs in scipy.stats:

Usage Statistics

  • Cohen’s d is mentioned in ~50,000 academic papers annually
  • Required reporting in APA Style guidelines for psychology research
  • Standard metric in meta-analysis across disciplines

Current Ecosystem Gap

# Users currently have to do this manually:
def manual_cohens_d(x, y):
    pooled_std = np.sqrt(((len(x)-1)*np.var(x, ddof=1) + 
                         (len(y)-1)*np.var(y, ddof=1)) / (len(x)+len(y)-2))
    return (np.mean(x) - np.mean(y)) / pooled_std

# Instead of clean scipy integration:
from scipy.stats import cohens_d
d = cohens_d(x, y)

Comparison with R

R’s effsize::cohen.d() is widely used - SciPy having this would improve Python’s position in statistical computing.

Implementation Quality

  • 32 comprehensive tests (100% coverage)

  • Cross-platform CI validation

  • Handles edge cases (NaN, empty arrays, etc.)

  • Follows SciPy API conventions exactly

@rgommers @ev-br @mdhaber @larsoner @tylerjereddy

Would appreciate maintainer input on:

  1. API design feedback

  2. Integration approach preferences

  3. Any concerns or suggestions

Thanks for considering this addition!

I find the entire effect size measure one-letter “zoo” difficult to understand.

I started with Cohen’s effect size measures, especially d and f or f-squared, when I added power and sample size computation.

The main question is whether the effect size measure should follow a basic definition or should it be the effect size measure appropriate for power computation for specific hypothesis tests.

For statsmodels I started to move away from using the basic Cohen’s effect size measures for power computation and use instead normalized (divided by nobs) noncentrality parameters.

example: variance heterogeneity in t-test or Anova.

Cohen’s d and f effect size assumes homogeneous variance across groups or samples. This means it is not fully appropriate for Welch t-test or Anova when variances are allowed to differ.

For example in statsmodels.stats.oneway.effectsize_oneway - statsmodels 0.15.0 (+824) I allow for an unequal_var option, so the effect size measure can be used for power computation for Welch and Brown-Forsythe mean anovas. see Notes section

About Cohen’s d

I was initially confused for some time because Cohen’s d divides by the (pooled) population standard deviation even in the 2-sample case. In that case it is not non-centrality / nobs.

Related: Statsmodels does not yet have eta or similar measure for regression models or anova after linear regression. The main reason is that I could not decide which definition to use.

For power computation it is much simpler to use the normalized non-centrality for the specific hypothesis test in regression models.

I have many stalled issues for effect sizes in statsmodels. So, if scipy gets the basic effect size measures, then I have to worry less about those and keep focusing on what’s appropriate for power and sample size computation. :slight_smile:

example: Measures of effect size · Issue #5896 · statsmodels/statsmodels · GitHub

The main point for me is that the standard effect size measures rely on very strong assumptions, like variance homogeneity or balanced sample in multiway anova. In statsmodels I try to emphasize methods that are robust to violations of these assumptions. Using standard effect size measures can be misleading if we use hypothesis tests that do not rely on these assumptions or are robust to some misspecification.

One more caveat for empirical effect size, calculated from a sample.

The empirical effect size in small samples can be strongly upward biased compared to the population effect size.

One article, that I read several years ago, showed that basing sample size computation on the effect size estimated from a small pilot study creates a strongly under-powered main study.

(I don’t remember for which hypothesis test, and I cannot find right now where I compute a bias corrected/reduced effect size estimate of the population effect size in statsmodels.)

In spite of all the caveats, effect size measures are still useful and in high demand.

Someone even came up with a common sense or common language effect size :slight_smile:

Thank you for these incredibly detailed and insightful responses! Your perspective on the effect size “zoo” and the evolution toward robust methods in statsmodels is really valuable, and your additional point about bias in small samples is absolutely crucial.

On the complementary ecosystem approach: You’ve highlighted exactly why I think SciPy needs these basic measures - so that specialized libraries like statsmodels can focus on the advanced, robust methods where your expertise really shines. The relationship you described (basic measures in SciPy, specialized power/noncentrality approaches in statsmodels) makes perfect sense. Having these basic building blocks could indeed help resolve some of those stalled statsmodels issues you mentioned, letting you focus on sophisticated methods.

Addressing the technical concerns: You’re absolutely right about the variance homogeneity limitations and the bias problem in small samples. My implementation actually addresses both of these issues:

  1. Variance heterogeneity: The ‘pooled’ parameter allows users to choose between pooled standard deviation (classic Cohen’s d) and first-sample standard deviation for cases where equal variance assumptions are violated.
  2. Small sample bias: The ‘bias_correction’ parameter applies Hedges’ g correction factor: g = d * (1 - 3/(4*df - 1)). This directly addresses the systematic overestimation you mentioned that leads to underpowered main studies when using pilot data.
    On documentation and assumptions: Your point about Welch t-test compatibility is particularly important. The documentation should explicitly warn when Cohen’s d may be inappropriate (heterogeneous variances) and reference statsmodels’ robust alternatives for those cases. The goal is definitely not to replace your sophisticated methods, but to provide well-documented basic tools with clear warnings about limitations.

Your comment about effect sizes being “in high demand despite all the caveats” really captures it perfectly. Researchers need these measures and they’re going to calculate them anyway. Having a well-documented, bias-aware implementation in SciPy with appropriate warnings seems better than everyone rolling their own versions with potentially incorrect formulas or missing bias corrections.

Some specific features that address your concerns:

  • ‘bias_correction=True’ for Hedges’ g (small sample correction)
  • pooled=False option for variance heterogeneity cases
  • Comprehensive documentation of assumptions and limitations
  • Clear warnings about when Cohen’s d may be inappropriate
    Support for paired samples (different assumptions than independent samples)
    The fact that you mention implementing bias-corrected effect sizes in statsmodels but “cannot find right now where” actually reinforces the value of having these standard measures in a central location with consistent APIs.

Questions for you:

  1. Would the ‘pooled=False’ option help address some of your variance heterogeneity concerns?

  2. How could the documentation better warn about assumptions and guide users toward statsmodels for robust alternatives?

  3. Are there other basic effect size measures (beyond Cohen’s d) that would be most valuable in SciPy?

  4. What would make this most useful as a foundation for your more advanced statsmodels implementations?

I’d love to collaborate on getting the assumptions and limitations documented properly so this serves the community well without misleading anyone about its appropriate use cases.

P.S. - Now I’m very curious about that “common language effect size” reference! :blush: