Hacker News new | ask | show | jobs
by jpivarski 1648 days ago
Hi! I'm the original author of Awkward Array (Jim Pivarski), though there are now many contributors with about five regulars. Two of my colleagues just pointed me here—I'm glad you're interested! I can answer any questions you have about it.

First, sorry about all the TODOs in the documentation: I laid out a table of contents structure as a reminder to myself of what ought to be written, but haven't had a chance to fill in all of the topics. From the front page (https://awkward-array.org/), if you click through to the Python API reference (https://awkward-array.readthedocs.io/), that site is 100% filled in. Like NumPy, the library consists of one basic data type, `ak.Array`, and a suite of functions that act on it, `ak.this` and `ak.that`. All of those functions are individually documented, and many have examples.

The basic idea starts with a data structure like Apache Arrow (https://arrow.apache.org/)—a tree of general, variable-length types, organized in memory as a collection of columnar arrays—but performs operations on the data without ever taking it out of its columnar form. (3.5 minute explanation here: https://youtu.be/2NxWpU7NArk?t=661) Those columnar operations are compiled (in C++); there's a core of structure-manipulation functions suggestively named "cpu-kernels" that will also be implemented in CUDA (some already have, but that's in an experimental stage).

A key aspect of this is that structure can be manipulated just by changing values in some internal arrays and rearranging the single tree organizing those arrays. If, for instance, you want to replace a bunch of objects in variable-length lists with another structure, it never needs to instantiate those objects or lists as explicit types (e.g. `struct` or `std::vector`), and so the functions don't need to be compiled for specific data types. You can define any new data types at runtime and the same compiled functions apply. Therefore, JIT compilation is not necessary.

We do have Numba extensions so that you can iterate over runtime-defined data types in JIT-compiled Numba, but that's a second way to manipulate the same data. By analogy with NumPy, you can compute many things using NumPy's precompiled functions, as long as you express your workflow in NumPy's vectorized way. Numba additionally allows you to express your workflow in imperative loops without losing performance. It's the same way with Awkward Array: unpacking a million record structures or slicing a million variable-length lists in a single function call makes use of some precompiled functions (no JIT), but iterating over them at scale with imperative for loops requires JIT-compilation in Numba.

Just as we work with Numba to provide both of these programming styles—array-oriented and imperative—we'll also be working with JAX to add autodifferentiation (Anish Biswas will be starting on this in January; he's actually continuing work from last spring, but in a different direction). We're also working with Martin Durant and Doug Davis to replace our homegrown lazy arrays with industry-standard Dask, as a new collection type (https://github.com/ContinuumIO/dask-awkward/). A lot of my time, with Ianna Osborne and Ioana Ifrim at my university, is being spent refactoring the internals to make these kinds of integrations easier (https://indico.cern.ch/event/855454/contributions/4605044/). We found that we had implemented too much in C++ and need more, but not all, of the code to be in Python to be able to interact with third-party libraries.

If you have any other questions, I'd be happy to answer them!

2 comments

Do you have some examples of slicing, masking, un-padding, and (I suppose) “Haskell-like” ops, eg fmap, but also eg treemap, vmap, pmap that are in Jax? Also grouping, cutting, and interweaving, and also.. this is kind of a weird ask but suppose I had an operation in pure assembly that is extremely fast w two int64 parameters outputting one int64, what’s the easiest path for me to get awkward to apply that to two Arrays and give me one Array back as output?
Okay, that's a lot of questions.

There are slice examples here, for all the different ways these arrays can be sliced: https://awkward-array.readthedocs.io/en/latest/_auto/ak.Arra...

That includes what I think you mean by "masking." (You mean keeping only the array elements that are `true` in a boolean array? There's another function we call ak.mask that keeps all array elements, but replaces the ones that line up with `false` with a missing value: https://awkward-array.readthedocs.io/en/latest/_auto/ak.mask...)

If you have irregular-length lists and you want to make them all the same length, that's padding, ak.pad_none: https://awkward-array.readthedocs.io/en/latest/_auto/ak.pad_... What's "un-padding"?

Mapping is implicit, as it is in NumPy. If you use an Awkward Array in any NumPy ufunc, including binary operators like `+`, `-`, `*`, `==`, `<`, `&`, etc., then all the arrays will be broadcasted and computed element-by-element. This is true whether the data structure is flat or a deep tree. ("Array-oriented" is a different style from "functional.")

There hasn't been much call for grouping yet—Awkward Array is more like a NumPy extension than a Pandas extension—but there is a way to do it by combining a few functions, which is described in the ak.run_lengths documentation: https://awkward-array.readthedocs.io/en/latest/_auto/ak.run_...

For wrapping a function with an ABI interface, I think the easiest way to do that would be to use Numba and ctypes.

    import ctypes
    import awkward as ak
    import numba as nb

    libm = ctypes.cdll.LoadLibrary("/lib/x86_64-linux-gnu/libm.so.6")
    libm_exp = libm.exp
    libm_exp.argtypes = (ctypes.c_double,)
    libm_exp.restype = ctypes.c_double

    libm_exp(0)    # 1.0
    libm_exp(1)    # 2.718281828459045
    libm_exp(10)   # 22026.465794806718

    @nb.vectorize([nb.float64(nb.float64)])
    def ufunc_exp(x):
        return libm_exp(x)

    array = ak.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])

    ufunc_exp(array)   # calls libm_exp on every value in array, returns an array of the same structure
Numba's @vectorize decorator (https://numba.pydata.org/numba-doc/latest/user/vectorize.htm...) makes a ufunc, and Awkward Array knows how to implicitly map ufuncs. (It is necessary to specify the signature in the @vectorize argument; otherwise, it won't be a true ufunc and Awkward won't recognize it.)

When Numba's JIT encounters a ctypes function, it goes to the ABI source and inserts a function pointer in the LLVM IR that it's generating. Unfortunately, that means that there is function-pointer indirection on each call, and whether that matters depends on how long-running the function is. If you mean that your assembly function is 0.1 ns per call or something, then yes, that function-pointer indirection is going to be the bottleneck. If you mean that your assembly function is 1 μs per call and that's fast, given what it does, then I think it would be alright.

If you need to remove the function-pointer indirection and still run on Awkward Arrays, there are other things we can do, but they're more involved. Ping me in a GitHub Issue or Discussion on https://github.com/scikit-hep/awkward-1.0

Amazing.. 100 gratitudes for you from me for taking the time out to explain all of that. Very impressed by this and all the work that’s been done.

Oh and for un-padding, I meant like how do I do the inverse of fill_none . pad_none

Also saw there was some stuff about algebraic types (eg semigroup reductions) - is that kind of algorithm-level type annotation a direction you all are interested in exploring further?

Un-padding: something like string-trimming (e.g. `str,rstrip`), but for missing values at the ends of lists... There isn't a function for that.

If you happen to know that the only uses of missing values are at the ends of lists, `ak.is_none` and `ak.sum` (with the appropriate `axis`) can count them, and you could perhaps construct a slice from that (negative to count from the end, and therefore slice off the missing values only). I'd have to think about it, but that would be the beginning of a columnar implementation of "unpad_none".

As for the algebraic types, I was using the terminology to explain what the reducers do. Some operations, like sum and product, have identities, and some don't, like argmin.

As for type annotations, I don't know what you mean. We're not using Python type annotations, but they'd be too coarse to describe what these operations do. Awkward-specialized type annotations might be overkill. For Dask, which needs to be able to predict types, we're passing tracer objects through the codebase to observe the types change without actually computing values, so it's a type-propagation by execution.

Ah interesting. Like so w the algebraic stuff I meant like, well if you have a semigroup or a monoid homomorphism it translates nicely into a parallel distributed computation problem- hence the semigroup flag works nicely with the reduction ops

So I was wondering how I could exploit Awkward’s typing system to use/implement some goodies from Haskell a la https://wiki.haskell.org/Typeclassopedia

Like, for instance, what if I could make an array of heterogenous ufuncs, and apply that to a similarly shaped array (like an Applicative).. like if I wanted to implement eg graph re-writing by applying a rules ufunc array to an adjancency array, etc, or even , to get very meta, apply a rules function array to another rules function array

Or if I wanted to compute eg the fixed point of a series of those applications, etc.

Or maybe if I wanted to use Arrow types to abstractly represent computations within each cell, do some fancy stuff in each cell, perform some rudimentary ’compiler optimization’ by inspecting which cells would end up doing unnecessary work (in the context of whatever problem I am doing; eg suppose I only permitted 3 chained ufunc calls per cell or something weird like that), that would be really cool too

Or eg if for some unknown reason I wanted each cell to fire off 2 concurrent ufuncs within each cell, and I only was interested in the result that ‘won’ the data race for each cell, I could use eg an Alternative in the style of the Concurrently library.

Or if I wanted eg each cell to be like a MonadPlus; do some work in the cell but also provide builtin “recovery” capabilities per cell if the cell evaluated to empty/missing/None

Ah now another interesting possibility could be a matrix of lambda calculus statements..!

Musings and sketches.. :)

Very very cool work indeed!

Was there no way to do this in Apache Arrow, or with some modifications to Arrow?
Naturally, we considered this! :) We needed a larger set of tree node types than Apache Arrow in order to perform some of these operations without descending all the way down a subtree. For example, the implementation of the slice described in the video I linked above requires list offsets to be described as two separate arrays, which we call `starts` and `stops`, rather than a single set of `offsets`. So we have a ListArray (most general, uses `starts` and `stops`) and a ListOffsetArray (for the special case with `offsets`) and some operations produce one, other operations produce the other (as an internal detail, hidden from high-level users). Arrow's ListType is equivalent to the ListOffsetArray. If we had been forced to only use ListOffsetArrays, then the slice described in the video would have to propagate down to all of a tree node's children, and we want to avoid that because a wide record can have a lot of children.

So Awkward Array has a superset of Arrow's node types. However, one of the great things about columnar data is that transformation between formats can share memory and be performed in constant time. In the ak.to_arrow/ak.from_arrow functions (https://awkward-array.readthedocs.io/en/latest/_auto/ak.to_a...), ListOffsetArrays are replaced with Arrow's list node type with shared memory (i.e. we give it to Arrow as a pointer). Our ListArrays are rewritten as ListOffsetArrays, propagating down the tree, before giving it to Arrow. If you're doing some array slicing and your goal is to end up with Arrow arrays, what we've effectively done is delayed the evaluation of the changes that have to happen in the list node's children until they're needed to fit Arrow's format. You might do several slices in your workflow, but the expensive propagation into the list node's children happens just once when the array is finally being sent to Arrow.

As far as what matters for users, Awkward Arrays are 100% compatible with Arrow through the ak.to_arrow/ak.from_arrow functions, and usually shares memory with O(1) cost (where "n" is the length of the array). When it isn't shared memory with O(1) conversion time, it's because it's doing evaluations that you were saved from having to do earlier.

That sounds great! I'm not working in Python though, but I am interested in the Awkward Array data structure. I'm trying to find example code that just uses libawkward from C++.
Then I need to warn you that libawkward.so is going away—it didn't serve the purposes that we had for it. The decision to downsize the C++ part is described in detail here: https://indico.cern.ch/event/855454/contributions/4605044/

I think it would be great to develop this concept in other languages, and I even tried to do some standardization before implementation at the start of the project, but when it came down to it, it had to be implemented in some language. Even then, as users needed specific functions, we added them directly in Python whenever possible, since the development time of doing it in C++ and only exposing it in Python was prohibitive. Clement Helsens found this out when he tried to use the C++ layer without Python (https://github.com/clementhelsens/FCCAnalyses/blob/awkward/a... a lot of what he needed wasn't there.

But the ideas are simple, and if you have a language in mind, building up a similar implementation could be pretty quick for a specific application—it's generality that takes a long time. This page: https://awkward-array.readthedocs.io/en/latest/ak.layout.Con... has links to all the node types that we've found useful and each link has a minimal, demo-style implementation—not the real implementation with all its baggage. They show what it looks like to start implementing such a thing.

I should also point out that Joosep Pata implemented an alternative Awkward Array: https://arxiv.org/abs/1906.06242 and https://github.com/hepaccelerate/hepaccelerate , not with a different interface language, but using CUDA for a real application faster than my group could get to it.

If the language you're talking about is Julia, I'm doubly interested. In high-energy physics, there's a group of us who think that Julia would be a great alternative to the mix of Python and C++ we currently have. Despite the comments above about Julia having this already, Awkward Array is a rather different thing from JIT-compiled data structures, and it would be as valuable in Julia as it is in Python, least of all because GPUs have to be addressed in a vectorized way.

> Despite the comments above about Julia having this already, Awkward Array is a rather different thing from JIT-compiled data structures, and it would be as valuable in Julia as it is in Python, least of all because GPUs have to be addressed in a vectorized way.

This is just done through multiple dispatch with type parameters in Julia. Vector{Vector{Float64}} is an "array of lists". Vector{Dict{Symbol,Float64}} is an "array of records". Vector{Union{Missing,Vector{Float64}} is an array of lists that can have missings, where the memory and generated code is optimized for this union. Vector{String} is a vector of strings. LazyArrays.jl provides lazy array implementations, which satisfy the AbstractArray interface so they can be used pretty much anywhere a standard Array can be. Primitives like flattening and grouping are just the built in `reduce(hcat,x)` etc. which have dispatches to optimize the operation to only allocate once. Etc.

As standard arrays, you can do all of the `sum`, `prod`, broadacsting, linear algebra, `\`, `transpose`, etc. on these objects because of dispatches on `::AbstractArray`. You can put CuArrays into these kinds of objects, which is one of the building blocks that Flux.jl uses to represent the neural network weights BTW. The entire DataFrames.jl is built on such generically typed arrays, so data filtering operations like Query.jl support these types along with their unions with missings. And these are the objects used across the whole language, so the JIT of course applies to these objects.

Then if you want heterogeneous arrays that don't mix implementation with representation, my favorite is https://github.com/jonniedie/ComponentArrays.jl which are arrays of arrays that are stored contiguously which thusly work well with linear algebra. https://jonniedie.github.io/ComponentArrays.jl/dev/examples/... that's an example of quickly implementing some neural networks and using it with AD, and it all works with broadcast etc. to build optimized GPU kernels.

I don't get what's missing? Seems like that's the list in the documentation.

I think the bigger piece is the reversal of structure. In Julia, additions to arrays are allowed by the user and not hardcoded into the Base library. "Allowing GPUs" or "complex numbers", or "dictionaries" as elements is not special cased. If you create a type for numbers with error bounds (https://github.com/JuliaPhysics/Measurements.jl), everything here still applies. Vector{Dict{Symbol,Measurements}} generates the efficient code. So by design it's impossible to enumerate all of the ways a Julia array can be used, and that's why it's built with multiple dispatch on parametric types. So while I listed out the examples going down the AckwardArray docs, note that by composability you also get all of the combinations of the above, along with combinations of any object a user builds in the language.

I commend the effort of hardcoding a lot of cases people might use in Python, but man this really makes clear how a general solution is usually the right idea after the 3rd specialization.

> I don't get what's missing? Seems like that's the list in the documentation.

The essential part of this is the fact that it is a columnar implementation. Although that's not interface-level, it's not a detail: the time complexities and degree of memory sharing between input and output are different for these algorithms. The video link above is fast-forwarded to the 3.5 minute section that describes this in detail. (I've talked about it in about a dozen talks, but that one is a pretty good presentation.)

Keep in mind, I think Julia is great! The composability and generality of the array abstract type make short work of a wide variety of non-columnar algorithms. I wish we were using it more in high energy physics. As I said above, I'm doubly interested if someone is thinking about doing columnar algorithms in Julia. (The multiple dispatch would make it a bit different: you probably wouldn't want a high-level wrapper like ak.Array, just a uniform interface on all the node types, since they don't have class methods to expose internals.)

The columnar approach has two main advantages, only one of which is to make up for Python's necessarily slow implementation. The other is that it improves opportunities for vectorization, the hardware equivalent of what we're doing for Python. Doing it in Julia would not be superfluous!