| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jpivarski 1646 days ago

Okay, that's a lot of questions.

There are slice examples here, for all the different ways these arrays can be sliced: https://awkward-array.readthedocs.io/en/latest/_auto/ak.Arra...

That includes what I think you mean by "masking." (You mean keeping only the array elements that are `true` in a boolean array? There's another function we call ak.mask that keeps all array elements, but replaces the ones that line up with `false` with a missing value: https://awkward-array.readthedocs.io/en/latest/_auto/ak.mask...)

If you have irregular-length lists and you want to make them all the same length, that's padding, ak.pad_none: https://awkward-array.readthedocs.io/en/latest/_auto/ak.pad_... What's "un-padding"?

Mapping is implicit, as it is in NumPy. If you use an Awkward Array in any NumPy ufunc, including binary operators like `+`, `-`, `*`, `==`, `<`, `&`, etc., then all the arrays will be broadcasted and computed element-by-element. This is true whether the data structure is flat or a deep tree. ("Array-oriented" is a different style from "functional.")

There hasn't been much call for grouping yet—Awkward Array is more like a NumPy extension than a Pandas extension—but there is a way to do it by combining a few functions, which is described in the ak.run_lengths documentation: https://awkward-array.readthedocs.io/en/latest/_auto/ak.run_...

For wrapping a function with an ABI interface, I think the easiest way to do that would be to use Numba and ctypes.

    import ctypes
    import awkward as ak
    import numba as nb

    libm = ctypes.cdll.LoadLibrary("/lib/x86_64-linux-gnu/libm.so.6")
    libm_exp = libm.exp
    libm_exp.argtypes = (ctypes.c_double,)
    libm_exp.restype = ctypes.c_double

    libm_exp(0)    # 1.0
    libm_exp(1)    # 2.718281828459045
    libm_exp(10)   # 22026.465794806718

    @nb.vectorize([nb.float64(nb.float64)])
    def ufunc_exp(x):
        return libm_exp(x)

    array = ak.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])

    ufunc_exp(array)   # calls libm_exp on every value in array, returns an array of the same structure

Numba's @vectorize decorator (https://numba.pydata.org/numba-doc/latest/user/vectorize.htm...) makes a ufunc, and Awkward Array knows how to implicitly map ufuncs. (It is necessary to specify the signature in the @vectorize argument; otherwise, it won't be a true ufunc and Awkward won't recognize it.)

When Numba's JIT encounters a ctypes function, it goes to the ABI source and inserts a function pointer in the LLVM IR that it's generating. Unfortunately, that means that there is function-pointer indirection on each call, and whether that matters depends on how long-running the function is. If you mean that your assembly function is 0.1 ns per call or something, then yes, that function-pointer indirection is going to be the bottleneck. If you mean that your assembly function is 1 μs per call and that's fast, given what it does, then I think it would be alright.

If you need to remove the function-pointer indirection and still run on Awkward Arrays, there are other things we can do, but they're more involved. Ping me in a GitHub Issue or Discussion on https://github.com/scikit-hep/awkward-1.0

1 comments

pizza 1646 days ago

Amazing.. 100 gratitudes for you from me for taking the time out to explain all of that. Very impressed by this and all the work that’s been done.

Oh and for un-padding, I meant like how do I do the inverse of fill_none . pad_none

Also saw there was some stuff about algebraic types (eg semigroup reductions) - is that kind of algorithm-level type annotation a direction you all are interested in exploring further?

link

jpivarski 1646 days ago

Un-padding: something like string-trimming (e.g. `str,rstrip`), but for missing values at the ends of lists... There isn't a function for that.

If you happen to know that the only uses of missing values are at the ends of lists, `ak.is_none` and `ak.sum` (with the appropriate `axis`) can count them, and you could perhaps construct a slice from that (negative to count from the end, and therefore slice off the missing values only). I'd have to think about it, but that would be the beginning of a columnar implementation of "unpad_none".

As for the algebraic types, I was using the terminology to explain what the reducers do. Some operations, like sum and product, have identities, and some don't, like argmin.

As for type annotations, I don't know what you mean. We're not using Python type annotations, but they'd be too coarse to describe what these operations do. Awkward-specialized type annotations might be overkill. For Dask, which needs to be able to predict types, we're passing tracer objects through the codebase to observe the types change without actually computing values, so it's a type-propagation by execution.

link

pizza 1645 days ago

Ah interesting. Like so w the algebraic stuff I meant like, well if you have a semigroup or a monoid homomorphism it translates nicely into a parallel distributed computation problem- hence the semigroup flag works nicely with the reduction ops

So I was wondering how I could exploit Awkward’s typing system to use/implement some goodies from Haskell a la https://wiki.haskell.org/Typeclassopedia

Like, for instance, what if I could make an array of heterogenous ufuncs, and apply that to a similarly shaped array (like an Applicative).. like if I wanted to implement eg graph re-writing by applying a rules ufunc array to an adjancency array, etc, or even , to get very meta, apply a rules function array to another rules function array

Or if I wanted to compute eg the fixed point of a series of those applications, etc.

Or maybe if I wanted to use Arrow types to abstractly represent computations within each cell, do some fancy stuff in each cell, perform some rudimentary ’compiler optimization’ by inspecting which cells would end up doing unnecessary work (in the context of whatever problem I am doing; eg suppose I only permitted 3 chained ufunc calls per cell or something weird like that), that would be really cool too

Or eg if for some unknown reason I wanted each cell to fire off 2 concurrent ufuncs within each cell, and I only was interested in the result that ‘won’ the data race for each cell, I could use eg an Alternative in the style of the Concurrently library.

Or if I wanted eg each cell to be like a MonadPlus; do some work in the cell but also provide builtin “recovery” capabilities per cell if the cell evaluated to empty/missing/None

Ah now another interesting possibility could be a matrix of lambda calculus statements..!

Musings and sketches.. :)

Very very cool work indeed!

link