Hacker News new | ask | show | jobs
by jpivarski 1646 days ago
Naturally, we considered this! :) We needed a larger set of tree node types than Apache Arrow in order to perform some of these operations without descending all the way down a subtree. For example, the implementation of the slice described in the video I linked above requires list offsets to be described as two separate arrays, which we call `starts` and `stops`, rather than a single set of `offsets`. So we have a ListArray (most general, uses `starts` and `stops`) and a ListOffsetArray (for the special case with `offsets`) and some operations produce one, other operations produce the other (as an internal detail, hidden from high-level users). Arrow's ListType is equivalent to the ListOffsetArray. If we had been forced to only use ListOffsetArrays, then the slice described in the video would have to propagate down to all of a tree node's children, and we want to avoid that because a wide record can have a lot of children.

So Awkward Array has a superset of Arrow's node types. However, one of the great things about columnar data is that transformation between formats can share memory and be performed in constant time. In the ak.to_arrow/ak.from_arrow functions (https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_a...), ListOffsetArrays are replaced with Arrow's list node type with shared memory (i.e. we give it to Arrow as a pointer). Our ListArrays are rewritten as ListOffsetArrays, propagating down the tree, before giving it to Arrow. If you're doing some array slicing and your goal is to end up with Arrow arrays, what we've effectively done is delayed the evaluation of the changes that have to happen in the list node's children until they're needed to fit Arrow's format. You might do several slices in your workflow, but the expensive propagation into the list node's children happens just once when the array is finally being sent to Arrow.

As far as what matters for users, Awkward Arrays are 100% compatible with Arrow through the ak.to_arrow/ak.from_arrow functions, and usually shares memory with O(1) cost (where "n" is the length of the array). When it isn't shared memory with O(1) conversion time, it's because it's doing evaluations that you were saved from having to do earlier.

1 comments

That sounds great! I'm not working in Python though, but I am interested in the Awkward Array data structure. I'm trying to find example code that just uses libawkward from C++.
Then I need to warn you that libawkward.so is going away—it didn't serve the purposes that we had for it. The decision to downsize the C++ part is described in detail here: https://indico.cern.ch/event/855454/contributions/4605044/

I think it would be great to develop this concept in other languages, and I even tried to do some standardization before implementation at the start of the project, but when it came down to it, it had to be implemented in some language. Even then, as users needed specific functions, we added them directly in Python whenever possible, since the development time of doing it in C++ and only exposing it in Python was prohibitive. Clement Helsens found this out when he tried to use the C++ layer without Python (https://github.com/clementhelsens/FCCAnalyses/blob/awkward/a... a lot of what he needed wasn't there.

But the ideas are simple, and if you have a language in mind, building up a similar implementation could be pretty quick for a specific application—it's generality that takes a long time. This page: https://awkward-array.readthedocs.io/en/latest/ak.layout.Con... has links to all the node types that we've found useful and each link has a minimal, demo-style implementation—not the real implementation with all its baggage. They show what it looks like to start implementing such a thing.

I should also point out that Joosep Pata implemented an alternative Awkward Array: https://arxiv.org/abs/1906.06242 and https://github.com/hepaccelerate/hepaccelerate , not with a different interface language, but using CUDA for a real application faster than my group could get to it.

If the language you're talking about is Julia, I'm doubly interested. In high-energy physics, there's a group of us who think that Julia would be a great alternative to the mix of Python and C++ we currently have. Despite the comments above about Julia having this already, Awkward Array is a rather different thing from JIT-compiled data structures, and it would be as valuable in Julia as it is in Python, least of all because GPUs have to be addressed in a vectorized way.

> Despite the comments above about Julia having this already, Awkward Array is a rather different thing from JIT-compiled data structures, and it would be as valuable in Julia as it is in Python, least of all because GPUs have to be addressed in a vectorized way.

This is just done through multiple dispatch with type parameters in Julia. Vector{Vector{Float64}} is an "array of lists". Vector{Dict{Symbol,Float64}} is an "array of records". Vector{Union{Missing,Vector{Float64}} is an array of lists that can have missings, where the memory and generated code is optimized for this union. Vector{String} is a vector of strings. LazyArrays.jl provides lazy array implementations, which satisfy the AbstractArray interface so they can be used pretty much anywhere a standard Array can be. Primitives like flattening and grouping are just the built in `reduce(hcat,x)` etc. which have dispatches to optimize the operation to only allocate once. Etc.

As standard arrays, you can do all of the `sum`, `prod`, broadacsting, linear algebra, `\`, `transpose`, etc. on these objects because of dispatches on `::AbstractArray`. You can put CuArrays into these kinds of objects, which is one of the building blocks that Flux.jl uses to represent the neural network weights BTW. The entire DataFrames.jl is built on such generically typed arrays, so data filtering operations like Query.jl support these types along with their unions with missings. And these are the objects used across the whole language, so the JIT of course applies to these objects.

Then if you want heterogeneous arrays that don't mix implementation with representation, my favorite is https://github.com/jonniedie/ComponentArrays.jl which are arrays of arrays that are stored contiguously which thusly work well with linear algebra. https://jonniedie.github.io/ComponentArrays.jl/dev/examples/... that's an example of quickly implementing some neural networks and using it with AD, and it all works with broadcast etc. to build optimized GPU kernels.

I don't get what's missing? Seems like that's the list in the documentation.

I think the bigger piece is the reversal of structure. In Julia, additions to arrays are allowed by the user and not hardcoded into the Base library. "Allowing GPUs" or "complex numbers", or "dictionaries" as elements is not special cased. If you create a type for numbers with error bounds (https://github.com/JuliaPhysics/Measurements.jl), everything here still applies. Vector{Dict{Symbol,Measurements}} generates the efficient code. So by design it's impossible to enumerate all of the ways a Julia array can be used, and that's why it's built with multiple dispatch on parametric types. So while I listed out the examples going down the AckwardArray docs, note that by composability you also get all of the combinations of the above, along with combinations of any object a user builds in the language.

I commend the effort of hardcoding a lot of cases people might use in Python, but man this really makes clear how a general solution is usually the right idea after the 3rd specialization.

> I don't get what's missing? Seems like that's the list in the documentation.

The essential part of this is the fact that it is a columnar implementation. Although that's not interface-level, it's not a detail: the time complexities and degree of memory sharing between input and output are different for these algorithms. The video link above is fast-forwarded to the 3.5 minute section that describes this in detail. (I've talked about it in about a dozen talks, but that one is a pretty good presentation.)

Keep in mind, I think Julia is great! The composability and generality of the array abstract type make short work of a wide variety of non-columnar algorithms. I wish we were using it more in high energy physics. As I said above, I'm doubly interested if someone is thinking about doing columnar algorithms in Julia. (The multiple dispatch would make it a bit different: you probably wouldn't want a high-level wrapper like ak.Array, just a uniform interface on all the node types, since they don't have class methods to expose internals.)

The columnar approach has two main advantages, only one of which is to make up for Python's necessarily slow implementation. The other is that it improves opportunities for vectorization, the hardware equivalent of what we're doing for Python. Doing it in Julia would not be superfluous!

In Julia you would just use an array of views for this, which can represent slices of arrays without any copies:

julia> struct Muon p_T::Float64; phi::Float64; eta::Float64; end

julia> a = reinterpret(Muon, rand(3*7))

7-element reinterpret(Muon, ::Vector{Float64}):

Muon(0.5512393381972832, 0.9349789151451744, 0.006690464595502932) Muon(0.5856015732294971, 0.19023473269375601, 0.40764209748521973) Muon(0.14872954753560852, 0.12281085717049867, 0.9307398048388644) Muon(0.7885776521084014, 0.1392696530731592, 0.4054805743644767) Muon(0.4841152655677211, 0.053858886714772236, 0.9556610184833677) Muon(0.5325190758093583, 0.31100637434877343, 0.4364100043728055) Muon(0.8697751162452897, 0.07683143115108726, 0.49822326551511953)

julia> jagged = [view(a, idx) for idx in [1:3, 4:4, 5:5, 6:7]]

4-element Vector{SubArray{Muon, 1, Base.ReinterpretArray{Muon, 1, Float64, Vector{Float64}, false}, Tuple{UnitRange{Int64}}, true}}:

[Muon(0.5512393381972832, 0.9349789151451744, 0.006690464595502932), Muon(0.5856015732294971, 0.19023473269375601, 0.40764209748521973), Muon(0.14872954753560852, 0.12281085717049867, 0.9307398048388644)] [Muon(0.7885776521084014, 0.1392696530731592, 0.4054805743644767)] [Muon(0.4841152655677211, 0.053858886714772236, 0.9556610184833677)] [Muon(0.5325190758093583, 0.31100637434877343, 0.4364100043728055), Muon(0.8697751162452897, 0.07683143115108726, 0.49822326551511953)]

Also note how I was able to tell Julia to just reinterpret a bunch of contiguous floating point values as objects of type `Muon`, which produced a `ReinterpretArray`. Nowhere in there did I ever copy any data from the original array produced by the `rand(3*7)` call.*

As mentioned in my post, ComponentArrays.jl will represent it columnarly and uses that for doing this efficiently with GPUs and automatic differentiation.