Hacker News new | ask | show | jobs
by ffriend 2868 days ago
I have quite limited experience with Cython and tried Numba just a couple of times, but I'm curious how much would it take to rewrite one of my Julia libraries to them.

The library is for reverse-mode automatic differentiation, but let's put AD itself aside and talk about code generation. As an input to code generator, I have a computational graph (or "tape") - a list of functions connecting input and intermediate variables. As an output I want a compiled function for CPU/GPU. (Note: Theano used to do exactly this, but that's a separate huge system not relevant no Numba or Cython).

In Julia I follow the following steps:

1. Convert all operations on the tape to expressions (~approx 1 line per operation type).

2. Eliminate common subexpressions.

3. Fuse broadcasting, e.g. rewrite:

    c .= a .* b
    e .= c .+ d
into

    e .= a .* b .+ d
Dot near operations means that they are applied elementwise without creating intermediate arrays. On CPU, Julia compiler / LLVM then generates code that reads and writes to memory exactly once (unlike e.g. what you would get with several separate operations on numpy arrays). On GPU, CUDAnative generates a single CUDA kernel which on my tests is ~1.5 times faster then several separate kernels. Note that `.=` also means that the result of operation is directly written to a (buffered) destination, so it no memory is allocated in the hot loop.

4. Rewrite everything I can into in-place operations. Notably, matrix multiplication `A * B` is replaced with BLAS/CUBLAS alternative.

5. Add to the expression function header, buffers and JIT-compile the result.

In Python, I imagine using `ast` module for code parsing and transformations like common expression elimination (how hard it would be?). Perhaps, Numba can be used to compile Python code to fast CPU and GPU code, but does it fit with AST? Also, do Numba or Cython do optimizations like broadcasting and kernel fusion? I'd love to see side-by-side comparison of capabilities in such a scenario!

1 comments

> In Julia I follow the following steps: … does it fit with AST?

I'm fairly certain the steps you've listed can be accomplished through AST manipulations, and would go something like

    def manip_ast(fn):
        import ast, inspect
        fn_ast = ast.parse(inspect.getsource(fn))
        new_fn_ast = …
        return compile(new_fn_ast, …)

    def rewrite(fn):
        fn = manip_ast(fn)
        fn = numba.jit(fn)

    @rewrite
    def func(*args):
        …
there's nothing in the language that prevents this from working with the autograd package, except no one's taken the time to implement it (https://github.com/HIPS/autograd/issues/47). That said, for many tasks with wide vector data, a DL framework is going to do ok, e.g. PyTorch.

> Julia compiler / LLVM then generates code that reads and writes to memory exactly once (unlike e.g. what you would get with several separate operations on numpy arrays)

Numba's gufuncs address exactly this + broadcasting over arbitrary input shapes. I've used this extensively. That said, I don't find fusing broadcasting is always a win, especially when arrays exceed cache size. Numba's CUDA support will also fuse jit functions into a single kernel, or generate device functions.

Sometimes you want manual control over kernel fusion, and I've found the Loopy (https://documen.tician.de/loopy/) to be fairly flexible in this regard, but it's a completely different approach compared to Numba/Julia.

I'd be interested in a side by side comparison as well, and I was thinking that the main difficulty would be that I couldn't write good Julia code, but maybe we can pair up, if that'd be interesting, to address several common topics that come up (fusion, broadcasting, generics but specialization, etc).

> there's nothing in the language that prevents this from working with the autograd package, except no one's taken the time to implement it (https://github.com/HIPS/autograd/issues/47).

I believe it's more complicated than most posters there realize, especially in the context of PyTorch (which uses a fork of autograd under the hood) with its dynamic graphs... Anyway, AD deserves its own discussion, that's I didn't want to concentrate on it.

> I'd be interested in a side by side comparison as well, and I was thinking that the main difficulty would be that I couldn't write good Julia code, but maybe we can pair up, if that'd be interesting, to address several common topics that come up (fusion, broadcasting, generics but specialization, etc).

Sounds good! Do you have a task at hand that would involve all the topics and could be implemented in limited time? Maybe some kind of Monte Carlo simulation or Gibbs sampling to get started?