Sparse array `has_canonical_format` and `has_sorted_indices` as read-only properties

Hi all,

I have been working on an update to the
scikit-sparse project, and have come
across a few instances where the scipy.sparse.csc_array properties
has_sorted_indices and has_canonical_format are inconsistent with the
underlying state of the indices array.

Further, the sort_indices() and sum_duplicates() methods, which are
purportedly used to get a matrix into sorted/canonical format, rely on these
flags to determine whether any work needs to be done. If the flags are
incorrect, then these methods do not do what they say they will. There are even
unit tests that demonstrate this behavior!

I have filed one such bug report.

It is frustrating as a developer to not be able to rely on these properties to
accurately reflect the state of the array.

I propose to remove the @property.setter methods for these two properties, so
that they are read-only. I see two possible options for how to handle
maintenance of these properties:

  1. Maintain the current method of caching the flags.
  • Pros:
    • Efficient access to these properties, as they can be retrieved in
      O(1) time.
    • Users who “know what they’re doing” can still manually set these
      properties if they are certain of the state of the array.
  • Cons:
    • This contract puts the onus on any method that modifies the array to
      maintain the correct state of these properties. It would take a lot of
      work to review the entire scipy.sparse codebase to ensure that this is
      the case.
  1. Remove the caching mechanism, and compute the values of these properties
    dynamically when accessed.
  • Pros:
    • Guarantees that the properties are always consistent with the state of the
      array.
    • Avoids the need to review and maintain the entire codebase to ensure
      consistency.
    • New methods that modify the array do not need to worry about
      maintaining these properties.
  • Cons:
    • Introduces computational overhead of O(nnz) when accessing.
    • Users who “know what they’re doing” can no longer manually set these
      properties.

There is a potential third option of rewriting the entire contract of the
csc_array and csr_array classes (and potentially others) to guarantee that
the matrix is always in canonical format after every operation. This is how
MATLAB handles its sparse matrix class. This would be a massive undertaking, but
would ultimately eliminate a lot of logic and potential bugs in checking and
maintaining these flags. It could also simplify a lot of internal
code, as all algorithms can assume that the input matrices are in canonical
format. I am curious if anyone out there relies on the ability to create sparse
matrices that are not in canonical format for performance or other reasons?

I am leaning towards option 1, since O(1) access is quite appealing. We should
create separate methods to explicitly check whether the indices are
sorted or have duplicates, which would perform the O(nnz) checks. I will
leave it up for debate whether these methods need to be public, or can be
left as private methods for internal testing only.

I would love to hear your thoughts on this proposal, and any other suggestions
you might have.

1 Like

It sounds like you’re suggesting that array.has_canonical_format = True would no longer be legal, but users would still be allowed to set the property under some other name, e.g. array._has_canonical_format = True.

Requiring users to do this doesn’t seem like it brings any benefits to them. I would expect that anyone who knows about has_canonical_format is already a power user of scipy.sparse.

It’s also not enough to forbid access to array.has_canonical_format, because users can change array.indptr or array.indices, and changing those can make a csr matrix no longer canonical. I would be against restricting those - it is occasionally useful to manually manipulate a matrix.

That seems like it would make these properties pointless for some operations.

For example, if one wanted to find the value within a csr_array() at row 10, column 5, and has_sorted_indices=False, then one must do a linear search over the values within indices[indptr[i]:indptr[i+1]]. If has_sorted_indices=True, then a binary search is enough.

However, if one has to verify that has_sorted_indices is correct by checking each value in the array, then there would be no point in doing this optimization: checking that has_sorted_indices is correctly set would take longer than using the linear search.

(csr_sample_values doesn’t implement this optimization, but this is an example of the kind of thing where has_sorted_indices would matter.)

That’s an interesting idea. The question is, how do you do that in a performant manner that is easier to verify than the status quo? In particular, are internal MATLAB routines implemented using this contract, or do they get access to an internal API that allows the kind of bugs that showed up in the linked issue?

To give an example of a useful non-canonical matrix, if you are using minimum_spanning_tree(), you can use a element in the array that is set to zero to represent a path with zero cost. Example SO post. If it were impossible to create non-canonical matrices, you’d have no way to represent this.

(Arguably you could use an element which is set to the minimum positive floating point value instead, which would be pretty similar.)

That is a fair point. I guess the idea of forcing a user to use the underscored
attribute makes it clear that they are doing something that could invalidate the
state of the matrix object.

I agree that the ability of the user to manipulate
a sparse matrix format directly is the essence of using SciPy sparse matrices
over something like MATLAB, which is more restrictive (see below).

I do not know the explicit details of the MATLAB internal routines, since it is largey closed source. My understanding is based on these references:

  • Gilbert (1992) paper[1] discusses the design philosophy behind MATLAB’s sparse routines.
  • Tim Davis’s book[2] (and SuiteSparse package) are the best references for many of the sparse algorithms used in MATLAB.

Davis’s book explicitly mentions that the user-facing routines require canonical format in §2.5, p 15: “MATLAB has no explicit function to sort its sparse matrices. Each function or operator that returns a sparse matrix is required to return it with sorted columns.” [^2]

Gilbert’s paper goes into more specific detail about the internal implementation of MATLAB’s sparse routines. Section 3.1.3 describes the “sparse accumulator”[^1], and the internal routines that manipulate it.

The way we would implement this interface in SciPy is to make the public
indices, indptr, and data arrays read-only as well, with underscored
versions that are used internally. Then, we can make the guarantee that a given
sparse matrix object is in a self-consistent state at all times. Power users
would subclass the sparse matrix classes if they need to manipulate those arrays.

Design Philosophy

All of this discussion gets into the philosophy of SciPy sparse matrices and
what use cases we want to support. MATLAB’s sparse matrices are more
restrictive by design, so that users can just call sparse(A) and get the
benefits of sparse matrix algorithms without worrying about the details.
Experimenting with, or extending MATLAB’s sparse matrix algorithms requires
writing MEX files in C.

In SciPy, by contrast, it is much easier to experiment with sparse matrix
algorithms, since the user can directly manipulate the sparse matrix formats. It
is also easier to make mistakes, as the linked issue shows. I think the correct
approach may be the current one (cached has_canonical_format flag), with more
documentation to make it clear that these flags can be invalidated by user
manipulation, and more testing to ensure that all user-facing routines check the
flags and sort indices as needed.

References

[1]: Gilbert, John R., Cleve Moler, and Robert Schreiber. “Sparse Matrices in MATLAB: Design and Implementation.” SIAM Journal on Matrix Analysis and Applications 13.1 (1992): 333-356.
https://epubs.siam.org/doi/abs/10.1137/0613024
[2]: Davis, Timothy A. (2006). Direct Methods for Sparse Linear Systems. SIAM.
https://epubs.siam.org/doi/book/10.1137/1.9780898718881, §2.5, p 15.


  1. Footnotes ↩︎

  2. Footnotes ↩︎

Interesting! That is a good example. I would note though that the
csr_has_canonical_format function does not check for explicit zeros,
but only for sorted indices and duplicates. So a matrix with explicit zeros
could still be “canonical”. MATLAB on the other hand, does not allow explicit
zeros in its sparse matrices at all, as they are simply removed when
the matrix is created or modified.

An interesting hack-around for this limitation is the SuiteSparse CHOLMOD MATLAB
interface. It is written with MEX files in C, which is the only way to access
MATLAB’s internal Jc, Ir, and Data arrays. It has a function
ldlchol that returns a sparse Cholesky factorization, but retains explicit zeros in
the factorization. There are explicit warnings in the documentation that the
user must not modify the matrix in MATLAB if they want to update the
factorization later, since MATLAB will automatically remove the explicit zeros
if the matrix is modified in any way.

(as a side note, there is no built-in MATLAB function to update a sparse Cholesky factorization, only the dense cholupdate).

Through all of this research, I have found a potential optimization for SciPy’s
csr_sort_indices function.

In §3.1.4, Gilbert, et al. state:

The MATLAB implementation actually violates the “time proportional to flops”
philosophy in one systematic way. The list of occupied row indices in the SPA
is not maintained in numerical order, but the sparse matrix data structure
does require row indices to be ordered. Sorting the row indices when storing
the SPA would theoretically imply an extra factor of O(log n) in the worst-case
running times of many of the matrix operations. All our algorithms could avoid
this factor–usually by storing the matrix with unordered row indices, then
using a linear-time transposition sort to reorder all the rows of the final
result at once–but for simplicity of programming we included the sort in
“spastore.” [^0]

MATLAB’s internal routines sort the indices one column at a time, once that column is operated on and re-stored into the sparse matrix.

SciPy uses a comparison sort O(n log n) in csr_sort_indices. As a separate
discussion point, I would actually suggest changing that function to use
a linear-time transposition sort, as Gilbert suggests. My own implementation of
a transpose sort is in my C++Sparse library.
Given that the CSparse source is GPL’d, and the transpose sort would be
derivative of that code (see Davis Exercises 2.7, 2.8, 2.11), it may, however,
be a non-starter to implement that in SciPy. Out of my own curiosity, I will do
some experiments to see how converting from CSC → CSR → CSC again at the python
level compares to directly implementing a transpose sort at the C++
level, as well as, the existing std::sort implementation.

That’s quite interesting, as a way to construct CSR/CSC matrices. When I read StackOverflow, it often seems like people don’t grok how CSR format works, and regard it as deep magic. It seems like the most popular option on SO is LIL, which has the pedigree of having both the worst memory usage and worst performance.

Allowing people to specify a CSR matrix by giving one row at a time in a Python loop seems like it could make the creation of CSR matrices much more accessible, particularly if it didn’t require the user to specify the number of nnz up front.

I don’t think that’s very clear. When we prefix a method or function with _, we might have one of several different reasons why.

  • The author doesn’t want to document it. (This is the most common reason I see in PRs. :slight_smile: )
  • We plan to change it and don’t want to be locked into this design.
  • The API is bad and doesn’t meet our standards for quality / reliability / clarity of our public APIs.
  • Allowing the user to change it could break something. (Which thing?)

I don’t think there’s a good way for the user to tell what meaning the SciPy developers had in mind when reading a _ prefixed attribute.

Unless we add a bug, which evidently we have already done. :slight_smile:

I’m a little confused by this comment by Gilbert. My understanding is that transposition sort is not linear time, and that no sort based upon comparing elements can be O(n).

My second thought was that he was perhaps talking about a counting sort or radix sort.

Do you know which one he means? I would look at your source code, but I want to avoid looking at GPL code in case I want to write SciPy code in this area in the future.

1 Like

Gilbert does not mean “transposition” sort in the sense that this wikipedia article describes. He is referring to “transpose” as in transpose the matrix. Another way to think of “transposing” a CSC matrix is to “reformat” it to a CSR matrix (or vice versa); i.e., the transpose function takes a CSC matrix A and returns a new matrix C. That new matrix C can either be thought of as equal to A, but in CSR format, or as equal to A.T, in CSC format.

The algorithm Davis uses for the transpose is a linear-time bucket sort (though it is beyond my own and also Davis’ scope to prove its complexity). Either way you interpret the format, the resulting matrix C has sorted indices. If you call transpose(C), you get the original CSC matrix A, but with sorted indices.

If you don’t want to look at the source code, you can probably work out for yourself how to iterate through the columns of a CSC matrix and place the entries into the CSR format in such a way that they are sorted.

In SciPy, you can write A = A.tocsr().tocsc() and you will get a sorted array (but will still have duplicates and explicit zeros!).

1 Like

Ah, that’s what I was missing. I thought he was talking about sorting an individual column in the sparse accumulator prior to adding it to the sparse matrix.

My intuition is that the O(n log n) algorithm would beat the O(n) algorithm here, because

  1. The n in the first algorithm refers to the number of entries in a particular column or row, which is usually small, which makes log(n) term grow slower than it appears, and
  2. Sorting within a particular row / column has good data locality / a small working set.

…but there’s an interesting theoretical case that running csr_tocsc twice could be faster. Let us know what you find.

I have submitted gh-23676 to fix the compressed format bug. I am working on the sorting timing and will report back.

1 Like