Best practices regarding accepting ragged arrays

vnmabus · November 17, 2023, 7:59am

I will repeat the question that I made on the Discord here, as I think it is better for future reference.

I want to ask what is the best practice for working with ragged arrays in the presence of arbitrary array backends (because I use the Python array API standard).

Context: Suppose I want to accept an “array” where each element contains a list of possible values for that position, and I want to evaluate all the combinations (so, the cartesian product, similar to meshgrid but returning just one array with the same shape as the original “array” and an additional dimension). For example, if I receive a 2x2 “array” that has:

[0, 1] at position (0, 0)
[2] at position (0, 1)
[3] at position (1, 0)
[4, 5, 6] at position (1, 1)

I want to return a 6x2x2 array that is:
[
[[0, 2], [3, 4]],
[[0, 2], [3, 5]],
[[0, 2], [3, 6]],
[[1, 2], [3, 4]],
[[1, 2], [3, 5]],
[[1, 2], [3, 6]],
]

My question is what is the recommended practice for accepting such an “array”. Should one:

Assume that the “array” is a NumPy array with object dtype, where its elements may be arrays with a different backend, such as Pytorch tensors.
Consider also the case in which the array is NOT of object dtype, for the case in which the number of values for each element is the same. A possible problem with this approach is masking the case in which the user calls np.array with a list of, for example, Pytorch tensors, converting them to numpy arrays by mistake.
Consider also nested sequences. I feel this is more complicated to deal with and less efficient.
Consider also non-NumPy alternatives, such as Awkward arrays or NestedTensors, A problem in this case is which kind of protocol we can define that is common for all of them, as I think even basic properties like “shape” are absent in some of them.