New function: scipy.stats.xi_correlation

fancidev · October 19, 2024, 6:46am

ENH: Add support for Xi Correlation in scipy · Issue #21453 · scipy/scipy · GitHub is an enhancement request to add Chatterjee’s \xi correlation coefficient to scipy.stats. ENH: `stats.xi_correlation`: add xi correlation function by mdhaber · Pull Request #21658 · scipy/scipy · GitHub adds such a function, xi_correlation(x, y), to compute the correlation coefficient and the p-value under null hypothesis of independent x and y. Below is a short summary of the statistic and proposal for its inclusion into scipy.stats.

Definition

The (sample) xi correlation coefficient was proposed by Chatterjee (2020). Let X and Y be (scalar-valued) random variables. Given paired observations (\mathbf{X},\mathbf{Y}) := (X_1, Y_1), \cdots, (X_n, Y_n) drawn independently from the joint distribution (X,Y), it computes a number \xi_n(\mathbf{X},\mathbf{Y}) between -1 and 1 that is close to 0 iff X and Y are independent and close to 1 iff Y is a (deterministic) function of X.

Chatterjee (2024) gave a good survey of many follow-up results that elaborate or extend the statistic.

What’s good about it

An important feature of the xi correlation coefficient is that it is able to identify non-monotonic, especially oscillatory, dependence between two random variables. A practical example was illustrated in Chatterjee (2020). Since most (all?) of the existing correlation coefficients in scipy.stats detect monotonic dependence, xi_correlation would be a valuable addition to fill the gap, in my opinion.

What’s special about it

The introduction of the xi correlation coefficient attracted significant interest in the literature, probably for two reasons I think:

The construction is really clever. There already exists a population version of \xi, proposed by Dette, Siburg, and Stoimenov (2013), that takes a value between 0 to 1 to indicate independence to full dependence. Chatterjee’s xi coefficient \xi_n is a sample statistic that consistently estimates \xi, and it takes an extremely simple form – essentially the sum of adjacent difference of the ranks of Y sorted by X. The simplicity seems surprising.
The coefficient has a simple asymptotic distribution. While other statistics exists that serve a similar purpose (i.e. identifying arbitrary functional dependence), none of them have a tractable distribution. In contrast, Chatterjee’s xi statistic is asymptotically normally distributed (with mean = 0 and variance = 0.4), and Chatterjee (2020) claims that the asymptotic approximation is good for sample size as small as 20. This makes the statistic highly tractable for theoretical investigation and extension.

The above features also allow scipy’s implementation to be simple and efficient, a desirable property for inclusion.

What is it not good at

The key deficiency of the statistic is that it is not as “powerful” as some of the competing statistics for testing independence (of X and Y). That is, if X and Y are in fact dependent, the xi correlation coefficient would need a much larger sample size to detect such dependence (by rejecting the null hypothesis) than some of the competing statistics. See Shi, Drton, and Han (2022) for the precise statement in term of local alternatives. Auddy, Deb, and Nandy (2023) suggests the difference in power is on the magnitude of O(n^{2}).

The lack of power means if Chatterjee’s coefficient detects dependence between X and Y, it’s likely that they’re really dependent, but if the coefficient fails to detect dependence, they may still be well dependent, but maybe not so strongly. In this sense, the statistic is on the conservative side. This may or may not be desirable depending on the application.

Through numerical experiments, Chatterjee (2020) also noticed the weak power against non-oscillatory, especially monotone, alternatives for testing independence. On the other hand, the experiments also appeared to suggest the statistic is sensitive to oscillatory alternatives. If this proves true, it could be a valuable property as there’re already many statistics dedicated to detecting monotone associations.

Do note that the above power analysis focuses on testing of independence, but the coefficient also quantifies the “strength” of dependence. This could be useful in data analysis.

Extensions

A possible extension of xi_correlation is to support a “symmetrized” version, defined by \max(\xi_n(\mathbf{X},\mathbf{Y}), \xi_n(\mathbf{Y},\mathbf{X})). Zhang (2023) showed that the asymptotic joint distribution of \xi_n(\mathbf{X},\mathbf{Y}) and \xi_n(\mathbf{Y},\mathbf{X}) under the null hypothesis of independence is bivariate normal with zero correlation. It follows that the asymptotic null distribution of the symmetrized statistic is skew normal.

Lin and Han (2023) proposed an extension of the xi correlation coefficient to improve its power by including more neighbors into the formula, rather than just including the adjacent neighbor. (The number of neighbors to include is a “tuning parameter” of the statistic.) They showed that the extended coefficient achieves “near parametric” power for a certain class of local alternatives. This could be a interesting extension to xi_correlation, but I feel it’s less straightforward due to the need to fix the tuning parameter.

Another possible extension is to test the null hypothesis \xi(X,Y) = \xi_0 for some \xi_0 > 0. xyz suggests that the statistic is optimal (in some sense) for such hypothesis testing, though I have not looked into the details.

Chatterjee (2024) investigated the extension of the coefficient to multivariate X and Y. The XICOR R package implements this extension. It seems beyond the scope of scipy.stats in my opinion.

Picking a name for the function

The function is called xi_correlation in the current PR. From the reference list of Chatterjee (2024), it seems the coefficient is mostly referred to as “Chatterjee’s rank (correlation) coefficient”. On the other hand, in the XICOR R package co-authored by Chatterjee, the function is called “xicor”.

To distinguish the sample coefficient from its population version, it might be more specific to include “chatterjee” in the function name. On the other hand, I’m not sure if it is a well known term (like say Kendall’s tau). So I don’t have a concrete proposal for the name.

References

Auddy, A. Deb, N., and Nandy, S. (2023). [2104.15140] Exact Detection Thresholds and Minimax Optimality of Chatterjee's Correlation Coefficient

Chatterjee, S. (2020). “A New Coefficient of Correlation.” Journal of the American Statistical Association, 116(536):2009-2022. arXiv:1909.10140

Chatterjee, S. (2024). “A Survey of Some Recent Developments in Measures of Association.” In Probability and Stochastic Processes, Springer. arXiv:2211.04702

Dette, H., Siburg K. F., and Stoimenov, P. A. (2013). “A Copula-Based Non-parametric Measure of Regression Dependence.” Scandinavian Journal of Statistics, 40(1):21-41.

Lin, Z. and Han, F. (2023). “On boosting the power of Chatterjee’s rank correlation.” Biometrika, 110(2):283–299. arXiv:2108.06828

Shi, H., Drton, M. and Han, F. (2022). “On the power of Chatterjee’s rank correlation.” Biometrika, 109(2):317-333. arXiv:2008.11619

Zhang, Q. (2023). “On the asymptotic null distribution of the symmetrized Chatterjee’s correlation coefficient.” Statistics & Probability Letters, 194:109759. arXiv:2205.01769