Hi all,
I have been working on an update to the
scikit-sparse project, and have come
across a few instances where the scipy.sparse.csc_array
properties
has_sorted_indices
and has_canonical_format
are inconsistent with the
underlying state of the indices
array.
Further, the sort_indices()
and sum_duplicates()
methods, which are
purportedly used to get a matrix into sorted/canonical format, rely on these
flags to determine whether any work needs to be done. If the flags are
incorrect, then these methods do not do what they say they will. There are even
unit tests that demonstrate this behavior!
I have filed one such bug report.
It is frustrating as a developer to not be able to rely on these properties to
accurately reflect the state of the array.
I propose to remove the @property.setter
methods for these two properties, so
that they are read-only. I see two possible options for how to handle
maintenance of these properties:
- Maintain the current method of caching the flags.
- Pros:
- Efficient access to these properties, as they can be retrieved in
O(1) time. - Users who “know what they’re doing” can still manually set these
properties if they are certain of the state of the array.
- Efficient access to these properties, as they can be retrieved in
- Cons:
- This contract puts the onus on any method that modifies the array to
maintain the correct state of these properties. It would take a lot of
work to review the entirescipy.sparse
codebase to ensure that this is
the case.
- This contract puts the onus on any method that modifies the array to
- Remove the caching mechanism, and compute the values of these properties
dynamically when accessed.
- Pros:
- Guarantees that the properties are always consistent with the state of the
array. - Avoids the need to review and maintain the entire codebase to ensure
consistency. - New methods that modify the array do not need to worry about
maintaining these properties.
- Guarantees that the properties are always consistent with the state of the
- Cons:
- Introduces computational overhead of O(nnz) when accessing.
- Users who “know what they’re doing” can no longer manually set these
properties.
There is a potential third option of rewriting the entire contract of the
csc_array
and csr_array
classes (and potentially others) to guarantee that
the matrix is always in canonical format after every operation. This is how
MATLAB handles its sparse matrix class. This would be a massive undertaking, but
would ultimately eliminate a lot of logic and potential bugs in checking and
maintaining these flags. It could also simplify a lot of internal
code, as all algorithms can assume that the input matrices are in canonical
format. I am curious if anyone out there relies on the ability to create sparse
matrices that are not in canonical format for performance or other reasons?
I am leaning towards option 1, since O(1) access is quite appealing. We should
create separate methods to explicitly check whether the indices are
sorted or have duplicates, which would perform the O(nnz) checks. I will
leave it up for debate whether these methods need to be public, or can be
left as private methods for internal testing only.
I would love to hear your thoughts on this proposal, and any other suggestions
you might have.