NumPy dtype=object
arrays are essentially a Python Sequence type with special performance characteristics: they’re mutable like lists but appending is expensive like tuples. They don’t provide the offloading of work from Python to C the way that non-object dtypes in NumPy do.
For your problem, the best you can do with dtype=object
is:
>>> values = np.array([0, 1, 2, 3, 4, 5, 6])
>>> array = np.array([
... values[0:2], values[2:3], values[3:4], values[5:7]],
... dtype=object,
... ).reshape(2, 2)
>>> array
array([[array([0, 1]), array([2])],
[array([3]), array([5, 6])]], dtype=object)
This at least stores the underlying values (0, 1, 2, 3, 4, 5, 6, 7
) contiguously. The nested “arrays
” are Python objects, and you can see that like this:
>>> np.frombuffer(array, dtype=np.intp)
array([140148166124112, 140148166124496, 140148166122096, 140148166122288])
These are pointers to PyObject*
s, not the contiguous underlying values. Another way to illustrate this is to use the CPython hack that id
returns the integer value of the PyObject*
pointer.
>>> [id(col) for row in array for col in row]
[140148166124112, 140148166124496, 140148166122096, 140148166122288]
Note that the pointers are not even increasing; they’re wherever malloc
found the space to put 112-byte np.ndarray
objects:
>>> [sys.getsizeof(col) for row in array for col in row]
[112, 112, 112, 112]
That may be fine for you, depending on how your use-case scales. These dtype=object
arrays are only a little better than the builtin list
type, but there are plenty of applications in which Python objects are performant enough.
If you are going to consider Awkward Array, here’s how you can do it: your data structure is regular 2×2 on the outside, with a variable-length array as the third dimension. You can build that like this:
>>> original = ak.to_regular([[[0, 1], [2]], [[3], [4, 5, 6]]], axis=1)
>>> original
<Array [[[0, 1], [2]], [[3], [4, ..., 6]]] type='2 * 2 * var * int64'>
The type is 2 * 2 * var * int64
to reflect the regularity on the first two dimensions and the irregularity on the third. There’s no attribute called shape
on this array. In this particular case, one could imagine calling the shape (2, 2, None)
or (2, 2, float("inf"))
or (2, 2, float("nan"))
or something, but you can’t get all three items of the tuple to be integers. Most cases are more complex than this and don’t even have an obvious extension for “shape”, so that’s why it’s not even there.
If you have a data source that gives you the underlying values
(as in my NumPy example above) with list lengths, 2, 1, 1, 3
, then you could use the ak.unflatten function to build it without Python iteration.
This array is different from the NumPy dtype=object
one in how it’s laid out in memory:
>>> original.layout
<RegularArray size='2' len='2'>
<content><ListOffsetArray len='4'>
<offsets><Index dtype='int64' len='5'>
[0 2 3 4 7]
</Index></offsets>
<content>
<NumpyArray dtype='int64' len='7'>[0 1 2 3 4 5 6]</NumpyArray>
</content>
</ListOffsetArray></content>
</RegularArray>
The 0, 1, 2, 3, 4, 5, 6, 7
values are contiguous, as I made them in the NumPy case, but instead of allocating a PyObject
for each list, the lists are implicitly indicated with the 0, 2, 3, 4, 7
offsets in a second buffer. No third buffer is needed for the outer dimension because it’s regular.
The computation you described is rather hard. There’s an ak.cartesian function for making Cartesian products of data from different arrays, but you want them all to come from different parts of the same array. That Cartesian product step can be performed as
>>> ak.cartesian(tuple(ak.flatten(original)), axis=0)
<Array [(0, 2, 3, 4), (...), ..., (1, 2, 3, 6)] type='6 * (int64, int64, in...'>
because tuple
iterates over the lists of the flattened original
, giving ak.cartesian the distinct arrays that it needs.
>>> tuple(ak.flatten(original))
(
<Array [0, 1] type='2 * int64'>,
<Array [2] type='1 * int64'>,
<Array [3] type='1 * int64'>,
<Array [4, 5, 6] type='3 * int64'>,
)
This step gives up on preventing the number of PyObjects
from scaling with the size of the data, which is why the NumPy dtype=object
is bad. (Each of the above ak.Arrays
is a PyObject
.) Nevertheless, we can keep going with some munging to get the final result that you want:
>>> ak.to_numpy(
... ak.concatenate(
... ak.unzip(
... ak.cartesian(
... tuple(ak.flatten(original)), axis=0
... )[:, np.newaxis]
... ),
... axis=1)
... ).reshape(-1, 2, 2)
array([[[0, 2],
[3, 4]],
[[0, 2],
[3, 5]],
[[0, 2],
[3, 6]],
[[1, 2],
[3, 4]],
[[1, 2],
[3, 5]],
[[1, 2],
[3, 6]]])
The scalability of this depends on whether “2×2” is a realistic size in your use-case. If it’s “2000000×2000000”, then no. But in that case, I don’t see how any Cartesian product would be feasible then, either.
Alternatively, you could fill the Cartesian product with Numba, since you can iterate over Awkward Arrays in Numba. That can reduce the memory overhead of intermediate arrays, but nothing can help if the output has to be huge.
import numba as nb
@nb.njit
def compute_cartesian(array):
num_rows = len(array)
num_cols = len(array[0]) # careful about len(array) == 0
num_combinations = 1
cumprod = np.empty(num_rows * num_cols + 1, dtype=np.int32)
cumprod[0] = 1
k = 1
for row in array:
for col in row:
cumprod[k] = cumprod[k - 1] * len(col)
k += 1
num_combinations = cumprod[-1]
output = np.empty(
(num_combinations, num_rows, num_cols),
dtype=np.int32,
)
for i, row in enumerate(array):
for j, col in enumerate(row):
index = i*len(row) + j
repeat = num_combinations // cumprod[index + 1]
tile = cumprod[index]
k = 0
for _ in range(tile):
for x in col:
for _ in range(repeat):
output[k, i, j] = x
k += 1
return output
and
>>> compute_cartesian(original)
array([[[0, 2],
[3, 4]],
[[0, 2],
[3, 5]],
[[0, 2],
[3, 6]],
[[1, 2],
[3, 4]],
[[1, 2],
[3, 5]],
[[1, 2],
[3, 6]]], dtype=int32)