Support sparse crosstabs with >2 input sequences, convert to sparse arrays

I propose that we enhance stats.contingency.crosstab by switching from sparse matrix to sparse array output when sparse=True. This allows >2 input sequences (currently sparse matrix output must be 2D so only 2 input sequences are allowed). The PR adds (sparse=True) parametrization of tests for 1D and 3D output cases.

PR #24652

The change is backward incompatible because a user could expect the output to be a sparse matrix instead of a sparse array and try to matrix multiply using * or expect a min/max over an axis to be 2D. But most users won’t notice that change. It is unusual to try matrix multiplication (or matrix power) on crosstab counts. And min/max by axis of the counts is unusual as well.

We want to replace coo_matrix with coo_array instead of e.g. adding a keyword option to indicate the desired output to avoid changing the function’a current bool kwarg which indicates sparse/dense to a kwarg that accepts bools and strs. It would be ugly api and code. Also, we don’t want to add an option for sparse matrices that we will not want in the future anyway.

If there is any reason the user needs the old behavior, they should wrap the 2nd output in coo_matrix. Something like:

values, count = crosstab(S1, S2, sparse=True)
count = coo_matrix(count)

The idea is to announce this change in release-notes as a minor backward incompatibility. Something like:

Backward incompatible changes:

  • The output of the sparse option of scipy.stats.contingency.crosstab returns a sparse array instead of a sparse matrix to allow nD crosstabs. The resulting 2nd output of the function is now of type coo_array. If you need the sparse result to use numpy.marix api like * for matrix multiply and/or always 2D shape for min/max over an axis, you should wrap that output in coo_matrix. See the “Migration from spmatrix to sparray” guide.

Comments welcome!

Dan

1 Like

Thanks for the proposal @dschult. I think this change is probably reasonable, and better than cluttering the API long-term with a conversion keyword, plus having all users deal with a deprecation explicitly. The impact will likely be very minor.

That said, this is clearly against our backwards compatibility policy, so I do find it hard to judge. In particular because these changes come in so piecemeal. It’d be more comfortable to have a full overview of all the APIs that are left where we currently return sparse matrices, and do them all in a single release with a more prominent warning, rather than keep making one-off decisions like these.

It’d be more comfortable to have a full overview of all the APIs that are left where we currently return sparse matrices, and do them all in a single release with a more prominent warning, rather than keep making one-off decisions like these.

Thanks @gommers,
This reply is an attempt to have a full overview of all the APIs that are left where we currently return sparse matrices. The following areas currently return sparse matrices:

  1. io: hb_read, loadmat, mmread (two functions with that name) #24644 (merged)
  • kwarg spmatrix=bool (default True) determines sparse type in v1.17.
  • In v1.18 we warn that the default will change in v1.20.
  1. 'spartial.KDTree.sparse_distance_matrix`: #24642 (merged)
  • kwarg output_type Options: ‘dok_matrix’ (default), ‘coo_matrix’, ‘dict’, or ‘ndarray’ in v1.17
  • In v1.18 we add support for options ‘dok_array’, ‘coo_array’.
  • In v1.19 we deprecate warn the default value from dok_matrix to dok_array.
  • In v1.21 we replace the default value from dok_matrix to dok_array.
  1. stats.contingency.crosstab: #26452 (open)
  • kwarg sparse=bool with default False in v1.17. When True, coo_matrix with max 2 sequences.
  • proposal to switch from coo_matrix to coo_array without changing the kwarg at all. This allows more than 2 sequences in the crosstab (just like ndarray input) with nD sparse output. No change in func sign or default value. True has “backward incompatible change” that shouldn’t affect many users. A release note in section on “backward incompatible changes” and the doc_string would suggest that users can wrap the result in coo_matrix to regain previous behavior if needed. Practically can’t think of a reason someone would matrix multiply a crosstab. But if they use A.min(axis=0) it would produce 1D result in v1.18 but a 2D result in v1.17.
  • alternative is to add string options to the sparse kwarg to indicate desire for a sparse matrix or sparse array. The result of sparse=True would need a deprecation cycle to switch from coo_matrix to coo_array. The string option coo_matrix would then need to be deprecated and eventually removed when spmatrix is removed. It’s not clear how/when to deprecate a string value coo_array as it causes 2 code changes through the transition.
  • discussion in the PR suggests that the alternative is a little backwards safer, and the proposal creates less disruption in API.
  1. integrate.ivp and bvp sparse jacobian handling: mostly _ivp.bdf.num_jac #24656 (open)
  • In v1.17, sparse only uses csc_matrix in the num_jac and other sparse jacobian functions.
  • These are effectively internal functions. The output of the function is used by the solver. And while the functions do not start with _, they are not imported to any outer namespace. Other obviously private functions in the same module do not start with _ either.
  • To use the function directly, a user would need an import like: from scipy.integrate._ivp.bdf import num_jac which is accessing a private namespace.
  • A search across github found 240 repos with num_jac(. Most are copies of SciPy, including few DAE-scipy solvers build on copies. Many others are typestubs repos for scipy. I looked through many pages of repos and none used num_jac outside of the solver itself.
  • proposal: switch the sparse handling from csc_matrix to csc_array. Accept spmatrix input and if so, provide spmatrix output; But ndarray-like or sparray input leads to sparray output. No change in output of any solvers will result (solver doesn’t use any functionality that differs between types). If someone accesses the function using the private importing method they would get a sparse array instead of a sparse matrix in v1.18. But that is an implementation detail we don’t need to protect.
  • alternative: switch to some form of kwarg in this internal private function that doesn’t seem to be called outside of the solver.
  1. sparse._construct functions that rely on input type to determine output type. #24886 (open)
  • kron, kronsum, bmat, tril, triu rely on input to determine output. But when all input is dense, in v1.17 the result defaults to spmatrix.
  • proposal: Deprecate the default case when no sparse input is used in v1.18. Replace the default case output with sparray in v1.20.