Best practices regarding accepting ragged arrays

vnmabus · November 17, 2023, 7:59am

I will repeat the question that I made on the Discord here, as I think it is better for future reference.

I want to ask what is the best practice for working with ragged arrays in the presence of arbitrary array backends (because I use the Python array API standard).

Context: Suppose I want to accept an “array” where each element contains a list of possible values for that position, and I want to evaluate all the combinations (so, the cartesian product, similar to meshgrid but returning just one array with the same shape as the original “array” and an additional dimension). For example, if I receive a 2x2 “array” that has:

[0, 1] at position (0, 0)
[2] at position (0, 1)
[3] at position (1, 0)
[4, 5, 6] at position (1, 1)

I want to return a 6x2x2 array that is:
[
[[0, 2], [3, 4]],
[[0, 2], [3, 5]],
[[0, 2], [3, 6]],
[[1, 2], [3, 4]],
[[1, 2], [3, 5]],
[[1, 2], [3, 6]],
]

My question is what is the recommended practice for accepting such an “array”. Should one:

Assume that the “array” is a NumPy array with object dtype, where its elements may be arrays with a different backend, such as Pytorch tensors.
Consider also the case in which the array is NOT of object dtype, for the case in which the number of values for each element is the same. A possible problem with this approach is masking the case in which the user calls np.array with a list of, for example, Pytorch tensors, converting them to numpy arrays by mistake.
Consider also nested sequences. I feel this is more complicated to deal with and less efficient.
Consider also non-NumPy alternatives, such as Awkward arrays or NestedTensors, A problem in this case is which kind of protocol we can define that is common for all of them, as I think even basic properties like “shape” are absent in some of them.

stefanv · November 17, 2023, 8:16am

/cc @seberg @rgommers @jpivarski

seberg · November 17, 2023, 8:30am

I honestly don’t have an answer. I think you need some wrapper object/class, what is inside as the “backend” could rely on whatever protocol you wish to, i.e. also array api or NumPy/__array_ufunc__.

So, maybe that thing has a numpy object array, but the objects inside could well be arbitrary array-likes.

The only thing I see is looking at AwkwardArray, since it is used prior art.

rgommers · November 17, 2023, 11:24am

Don’t do this for sure. Object arrays are pretty bad, and we’re trying to create them less in numpy and steer users away from them. There are several much better designed alternatives: AwkwardArray, NestedTensor, and perhaps to some extent PyArrow for 1-D/2-D cases (since Arrow has nested dtypes).

Indeed, I’d avoid those - in case you need arrays for performance, don’t accepted nested sequences. It should be one or the other.

This is the only real option I think. However, I would say that it’s now feasible to write array library agnostic code for regularly-shaped array libraries. No one, as far as I know, has taken such a next step yet to include irregularly-shaped arrays. I’m not sure for what cases you need both. I’d probably start with explicit support for only AwkwardArray and NestedTensor (or even just one of those) before considering something more complicated.

vnmabus · November 17, 2023, 11:49am

The thing is I just need the object array acting mostly as a container to hold the individual arrays (so that I know to which position each belongs). The only operations that I do in such an array is reshape (to flatten it before calling meshgrid), indexing, and getting its shape (to reshape after calling meshgrid). I thing an object NumPy array is the ideal type for that.

If I were in a language with a native array type, such as C, this would be a multidimensional array containing pointers to the corresponding NumPy or Pytorch arrays. However, as the array module in Python does not allow neither multidimensional arrays nor object dtype, I see NumPy object arrays as the second best choice. Is that wrong?

jpivarski · November 17, 2023, 8:05pm

NumPy dtype=object arrays are essentially a Python Sequence type with special performance characteristics: they’re mutable like lists but appending is expensive like tuples. They don’t provide the offloading of work from Python to C the way that non-object dtypes in NumPy do.

For your problem, the best you can do with dtype=object is:

>>> values = np.array([0, 1, 2, 3, 4, 5, 6])
>>> array = np.array([
...     values[0:2], values[2:3], values[3:4], values[5:7]],
...     dtype=object,
... ).reshape(2, 2)
>>> array
array([[array([0, 1]), array([2])],
       [array([3]), array([5, 6])]], dtype=object)

This at least stores the underlying values (0, 1, 2, 3, 4, 5, 6, 7) contiguously. The nested “arrays” are Python objects, and you can see that like this:

>>> np.frombuffer(array, dtype=np.intp)
array([140148166124112, 140148166124496, 140148166122096, 140148166122288])

These are pointers to PyObject*s, not the contiguous underlying values. Another way to illustrate this is to use the CPython hack that id returns the integer value of the PyObject* pointer.

>>> [id(col) for row in array for col in row]
[140148166124112, 140148166124496, 140148166122096, 140148166122288]

Note that the pointers are not even increasing; they’re wherever malloc found the space to put 112-byte np.ndarray objects:

>>> [sys.getsizeof(col) for row in array for col in row]
[112, 112, 112, 112]

That may be fine for you, depending on how your use-case scales. These dtype=object arrays are only a little better than the builtin list type, but there are plenty of applications in which Python objects are performant enough.

If you are going to consider Awkward Array, here’s how you can do it: your data structure is regular 2×2 on the outside, with a variable-length array as the third dimension. You can build that like this:

>>> original = ak.to_regular([[[0, 1], [2]], [[3], [4, 5, 6]]], axis=1)
>>> original
<Array [[[0, 1], [2]], [[3], [4, ..., 6]]] type='2 * 2 * var * int64'>

The type is 2 * 2 * var * int64 to reflect the regularity on the first two dimensions and the irregularity on the third. There’s no attribute called shape on this array. In this particular case, one could imagine calling the shape (2, 2, None) or (2, 2, float("inf")) or (2, 2, float("nan")) or something, but you can’t get all three items of the tuple to be integers. Most cases are more complex than this and don’t even have an obvious extension for “shape”, so that’s why it’s not even there.

If you have a data source that gives you the underlying values (as in my NumPy example above) with list lengths, 2, 1, 1, 3, then you could use the ak.unflatten function to build it without Python iteration.

This array is different from the NumPy dtype=object one in how it’s laid out in memory:

>>> original.layout
<RegularArray size='2' len='2'>
    <content><ListOffsetArray len='4'>
        <offsets><Index dtype='int64' len='5'>
            [0 2 3 4 7]
        </Index></offsets>
        <content>
            <NumpyArray dtype='int64' len='7'>[0 1 2 3 4 5 6]</NumpyArray>
        </content>
    </ListOffsetArray></content>
</RegularArray>

The 0, 1, 2, 3, 4, 5, 6, 7 values are contiguous, as I made them in the NumPy case, but instead of allocating a PyObject for each list, the lists are implicitly indicated with the 0, 2, 3, 4, 7 offsets in a second buffer. No third buffer is needed for the outer dimension because it’s regular.

The computation you described is rather hard. There’s an ak.cartesian function for making Cartesian products of data from different arrays, but you want them all to come from different parts of the same array. That Cartesian product step can be performed as

>>> ak.cartesian(tuple(ak.flatten(original)), axis=0)
<Array [(0, 2, 3, 4), (...), ..., (1, 2, 3, 6)] type='6 * (int64, int64, in...'>

because tuple iterates over the lists of the flattened original, giving ak.cartesian the distinct arrays that it needs.

>>> tuple(ak.flatten(original))
(
    <Array [0, 1] type='2 * int64'>,
    <Array [2] type='1 * int64'>,
    <Array [3] type='1 * int64'>,
    <Array [4, 5, 6] type='3 * int64'>,
)

This step gives up on preventing the number of PyObjects from scaling with the size of the data, which is why the NumPy dtype=object is bad. (Each of the above ak.Arrays is a PyObject.) Nevertheless, we can keep going with some munging to get the final result that you want:

>>> ak.to_numpy(
...     ak.concatenate(
...         ak.unzip(
...             ak.cartesian(
...                 tuple(ak.flatten(original)), axis=0
...             )[:, np.newaxis]
...         ),
...         axis=1)
... ).reshape(-1, 2, 2)
array([[[0, 2],
        [3, 4]],

       [[0, 2],
        [3, 5]],

       [[0, 2],
        [3, 6]],

       [[1, 2],
        [3, 4]],

       [[1, 2],
        [3, 5]],

       [[1, 2],
        [3, 6]]])

The scalability of this depends on whether “2×2” is a realistic size in your use-case. If it’s “2000000×2000000”, then no. But in that case, I don’t see how any Cartesian product would be feasible then, either.

Alternatively, you could fill the Cartesian product with Numba, since you can iterate over Awkward Arrays in Numba. That can reduce the memory overhead of intermediate arrays, but nothing can help if the output has to be huge.

import numba as nb

@nb.njit
def compute_cartesian(array):
    num_rows = len(array)
    num_cols = len(array[0])  # careful about len(array) == 0
    num_combinations = 1
    cumprod = np.empty(num_rows * num_cols + 1, dtype=np.int32)
    cumprod[0] = 1
    k = 1
    for row in array:
        for col in row:
            cumprod[k] = cumprod[k - 1] * len(col)
            k += 1
    num_combinations = cumprod[-1]
    output = np.empty(
        (num_combinations, num_rows, num_cols),
        dtype=np.int32,
    )
    for i, row in enumerate(array):
        for j, col in enumerate(row):
            index = i*len(row) + j
            repeat = num_combinations // cumprod[index + 1]
            tile = cumprod[index]
            k = 0
            for _ in range(tile):
                for x in col:
                    for _ in range(repeat):
                        output[k, i, j] = x
                        k += 1
    return output

and

>>> compute_cartesian(original)
array([[[0, 2],
        [3, 4]],

       [[0, 2],
        [3, 5]],

       [[0, 2],
        [3, 6]],

       [[1, 2],
        [3, 4]],

       [[1, 2],
        [3, 5]],

       [[1, 2],
        [3, 6]]], dtype=int32)

vnmabus · November 28, 2023, 10:30am

First, thank @jpivarski for your detailed answer. I knew some of the things you mentioned, but other were new and rather interesting.

For my problem, I am using NumPy object arrays as a Python container of NumPy (or Pytorch, or any array API standard compatible) arrays, albeit one that is multivariate (has tuple indexes, keeps a shape property, etc). For that I think that NumPy arrays with object dtype are good enough. Performance should not be a problem, as the size of this array needs to be small (otherwise the number of combinations in the cartesian product is unbearable). Just to be clear on the problem I am dealing with, my objective is to evaluate functions (that take one array as an input) at a grid of values. That is why I wanted to keep track of which grid corresponds to which array coordinate.

I already knew that NumPy object arrays keep pointers to Python objects. By the way, notice that in your example the way in which you create the object array is error-prone, as the shape of the returned array depends on if the inner arrays have the same shape or not. NumPy object arrays unfortunately need to be preallocated, as there is no current way of letting NumPy know that the inner objects must be kept as they are.

My question were more in the line of “if I want to accept arrays of arrays, which ones should I accept and which interface can I rely on for working with all of them”. We have already seen that accepting NumPy object arrays is acceptable in this case. I am not familiarized, however, with other solutions such as NestedTensor and Awkward arrays, or other possibilities that users may consider reasonable to pass. From what I see, and the answers I have received here, it seems that there is no common interface, and moreover I think there is no array-api compatible version of NestedTensor and Awkward arrays. Thus, I think I will limit myself to NumPy object arrays as the array container (the internal arrays may still belong to an arbitrary array API-compatible object), and will revisit that decision in the future if something better becomes available.

Thanks to everyone who commented for the quick and thorough responses.

jpivarski · November 28, 2023, 8:37pm

Thanks for the feedback! (I thought it was better to say too much than too little. And anyway, someone else might find this thread useful someday.)

I don’t think there’s any way for ragged arrays to be Array-API compatible; I think it would have to be a different API, a third one after rectilinear arrays and DataFrames.

I got excited for a moment when I saw that Array-API’s array.shape is specified as

Tuple[Optional[int], ...]

Could None values in the array.shape refer to ragged dimensions? No, they’re for unknown dimensions; dimensions whose size can only be determined from data values. (For instance, unfriendly to JAX, or would require a synchronize-and-read-from-GPU in CuPy.)

vnmabus · November 29, 2023, 11:54am

I am not sure if it is possible, but I would rather have just one package incorporating ragged array behavior on top of any array API-compatible backend, than having one implementation for each. In case that there is just one, there is no need of a standard (but I would try to make the implementation as similar as possible to the array API standard. In particular, it should behave identically when there are no ragged dimensions).

Another point that I wanted to make, after diving a bit more into this, is that “array of arrays” is NOT exactly the same as ragged arrays. The difference is subtle, similarly to the difference between “vector of vectors” and “matrix”. They can be used almost interchangeably in many contexts, but in the first case the array or vector belongs to the “dtype”, and in the second case the “dtype” is an scalar. Thus, in the first case the dimensions of the array/vector are not included in the shape of the parent structure, nor they can be used as the “axis” parameter, etc. There are cases in which this distinction is relevant.

jpivarski · November 29, 2023, 5:38pm

Absolutely. When I present on Awkward Array, I usually make two points: (1) performance, and (2) having a user interface in which you can slice all axes, including the ragged ones, improves usability.

Since I was looking at Array-API yesterday, I’ve been thinking about a new project: a small, pure Python package that only does ragged arrays (regular and irregular dimensions only, no missing data, records, heterogenous data, or anything else) and exactly follows the Array-API standard. Slicing and reducer implementations are tricky, so this package would depend on Awkward Array and use its implementations (might even contain an Awkward Array object), but it would present a different interface to exclude fancy types and strictly adhere to Array-API.

This involves a reinterpretation of Array-API, so it might be following the letter of Array-API’s protocols, but not their spirit. For instance, it’s clear that array.shape is a Tuple[Optional[int], ...] so that None can mean, “This dimension has a single-int valued length, but we don’t know what it is.”^[1] The ragged array library would instead be interpreting it as, “This dimension has a different int valued length for each element.”^[2] I don’t know if, when we work through the other parts of the API, we’ll run into a contradiction with this different interpretation, but I’ll at least look through it to see if it can be done.

And if it can’t, I’ll know where the contradiction is, and maybe we can negotiate a corner (special case) of Array-API to allow for raggedness by relaxing whatever’s preventing it.

epistemological! ↩︎
ontological! ↩︎

jpivarski · December 30, 2023, 6:48pm

Follow up on this topic at Why does this library exist? · jpivarski/ragged · Discussion #6 · GitHub