|
|
|
|
|
by jpivarski
1646 days ago
|
|
Naturally, we considered this! :) We needed a larger set of tree node types than Apache Arrow in order to perform some of these operations without descending all the way down a subtree. For example, the implementation of the slice described in the video I linked above requires list offsets to be described as two separate arrays, which we call `starts` and `stops`, rather than a single set of `offsets`. So we have a ListArray (most general, uses `starts` and `stops`) and a ListOffsetArray (for the special case with `offsets`) and some operations produce one, other operations produce the other (as an internal detail, hidden from high-level users). Arrow's ListType is equivalent to the ListOffsetArray. If we had been forced to only use ListOffsetArrays, then the slice described in the video would have to propagate down to all of a tree node's children, and we want to avoid that because a wide record can have a lot of children. So Awkward Array has a superset of Arrow's node types. However, one of the great things about columnar data is that transformation between formats can share memory and be performed in constant time. In the ak.to_arrow/ak.from_arrow functions (https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_a...), ListOffsetArrays are replaced with Arrow's list node type with shared memory (i.e. we give it to Arrow as a pointer). Our ListArrays are rewritten as ListOffsetArrays, propagating down the tree, before giving it to Arrow. If you're doing some array slicing and your goal is to end up with Arrow arrays, what we've effectively done is delayed the evaluation of the changes that have to happen in the list node's children until they're needed to fit Arrow's format. You might do several slices in your workflow, but the expensive propagation into the list node's children happens just once when the array is finally being sent to Arrow. As far as what matters for users, Awkward Arrays are 100% compatible with Arrow through the ak.to_arrow/ak.from_arrow functions, and usually shares memory with O(1) cost (where "n" is the length of the array). When it isn't shared memory with O(1) conversion time, it's because it's doing evaluations that you were saved from having to do earlier. |
|