Faster Mahalanobis distance computation

fancidev · October 2, 2024, 9:30am

Hi there,

I’d like to revive discussion about an old issue ENH: spatial: `cdist` is slow for `metric='mahalanobis'` · Issue #12570 · scipy/scipy · GitHub that reports scipy.spatial.distance.cdist(X, Y, ‘mahalanobis’) is slow for large input (X and Y both of size 500x500).

For (row) vectors u of v of length n and symmetric positive-definite matrix \Sigma^{-1} of size \times n, the Mahalanobis distance between u and v is defined to be the square root of (u-v)\Sigma^{-1}(u-v)’, where ’ denotes transpose.

The current cdist implementation computes Mahalanobis (cross) distance between the rows of X and Y by a double loop with vector-matrix-vector product in the inner loop. If X is of shape p\times n and Y is of shape q\times n, the time complexity is O(pqn^2).

If p and q are large, an apparently faster calculation method to first perform Cholesky decomposition to write \Sigma^{-1}=LL’. Then (u-v)\Sigma^{-1}(u-v)’= || (u-v) L ||^2. And cdist(X, Y, ‘Mahalanobis’) = cdist(XL, YL, ‘euclidean’). The time complexity of the entire computation is O(n^3) for Cholesky plus (p+q)n^2 for the linear transform plus O(pqn) for Euclidean distance computation. This is an order of magnitude faster if p=q=n, which is significant for large input.

In fact, a similar idea was proposed in MAINT: svd-based mahalanobis distance by argriffing · Pull Request #4886 · scipy/scipy · GitHub, but that PR was closed without conclusion.

So I’m writing this thread to ask if you are aware of any prior discussion on the calculation of Mahalanobis. And if you think the Cholesky based method makes sense?

More details about the method is included in the comments of ENH: spatial: `cdist` is slow for `metric='mahalanobis'` · Issue #12570 · scipy/scipy · GitHub.

Thanks!