Hacker News new | ask | show | jobs
by abeppu 2478 days ago
I'm not a Swift programmer, so perhaps my confusion is just a symptom of broader ignorance, but I find two things unclear here: - What does 'first-class' mean, really? - Which of these benefits are unique to integrating notions of derivatives into the language, and which could be enjoyed well-written libraries?

The mega-proposal links out to a separate doc on embedded DSLs, with broad statements about what's "typical" or "often" true of existing DSLs -- but which of those issues are insurmountable? That section mentions Swift's limited metaprogramming facilities. Why choose to carve out this added support for a single family of algorithms rather than add in some more general metaprogramming abilities that enable better EDSLs?

3 comments

From the proposal:

> While the differentiation APIs are flexible and fully dynamic, differentiation is based on a program transformation that happens at compile-time. This enables many static analyses that not only help produce more efficient programs, but also detect common numerical programming mistakes such as non-differentiable functions and zero derivatives.

> With a first-class differentiable programming language, some of the most common runtime errors in machine learning become directly debuggable without library boundaries. Simply step through backpropagation using LLDB to debug derivatives.

I still don't get it. Why can't I use a debugger to step through derivatives when autodiff is implemented as a library?
Reverse mode autodiff is best implemented with a non-local transformation of the program: First you run the original operations forward; then you run corresponding operations in reverse order.

You can do this with a library by implementing "number" types that, as a side effect of arithmetic operations, record those operations onto a "tape", so that corresponding (different) operations can be played back later in reverse order.

Unfortunately, those side effects have a cost during the forward pass. And during the backwards pass you are essentially running a little interpreter to execute your instructions.

Putting this stuff into the compiler lets the forward pass just be normal, native code on doubles/floats, and the same for the reverse-pass code. Moreover, all of this can now be worked on by the optimizing compiler.

This is especially important for embedded applications where you can't afford all this interpretation.

I guess the original TensorFlow "computation graph" approach is a little different to the "tape" I described because the graph is built explicitly rather than through side effects, but that just makes even more clear that you're really assembling an AST for some /other/ language and (when not using XLA) running an interpreter.

In principle some suitably powerful macro language with full access to the AST might be able to do good compile-time reverse-mode AD, but I am not aware of any language with powerful enough macros. Again, this is because the transformation is not just a local pattern replacement; it involves flipping the code upside down.

What's funny is that physicists have been able to do this kind of program transformation in a slightly clunky(?) way -- FORTRAN in, FORTRAN out -- since the 70s, supposedly.

It all makes me think that our ideas about what a "language" is, what a "compiler" is, what a "library" is, are all stuck in convention and prematurely ossified. The first compilers, which we celebrate, looked to their users like codegen (which we detest, I think)! I would love for the boundary between "language" and "compiler" to be broken down more so that AD could be more easily done "within the language"; maybe one day that will happen in Julia, or Nim, or Jai, or Terra.

But for now I think I agree with the designers that the best hope for good results, and really the most straightforward way, is to just do it in the compiler.

Just complementing the other reply, there is a small article about how you can implement a source to source AD in Julia [1]. Basically Julia has a special type of macro called generated function [2] which instead of executing during the AST lowering phase (when the compiler still didn't evaluate the symbols) it executes during the final step of compilation (when type inference already ran and you have all the exact types), and in that function you can return either the AST or Julia's SSA IR directly (which is good for AD since it closely resembles the execution graph since it avoids mutability). And you can also inspect the IR of any function call and manipulate it within the language [3], so you can recursively create the tape entirely at compile time.

[1] http://blog.rogerluo.me/2019/07/27/yassad/

[2] https://docs.julialang.org/en/v1/manual/metaprogramming/inde...

[3] https://mikeinnes.github.io/IRTools.jl/latest/#Evaluating-IR...

It's already bring done in Julia. See zygote.jl.

Works with typed IR equivalent to a core compiler pass but from a third party package with regular Julia staged programming.

I think it's mostly due to a couple of reasons:

1. The debugging experience is probably better. Computing a derivative can be complex— you might be seeing a high value where you were expecting a low one, and you want to "step through the equation and how it changes. Having to do that when "the equation" is a bunch of obscure data structures, through the internal representation of functions in 3-rd party library could get very complex very quickly.

2. You might not catch non-differentiable functions and zero derivatives. Moreover, given just the nature of math, you could see millions of inputs that wouldn't trigger an exception, ship your model, and then one day the one that yields a 0 shows up, your model crashes, and you don't know why. Having the compiler essentially act as proof that something will _never_ be zero is awesome for correctness and reliability.

If you think about it, derivatives aren't really an operation "through" the equation, but "on" the equation. You're writing some function, but instead of passing a value through it, you're changing the function itself.

So the functions are the values, and need to be changed, morphed, combined, split, etc. If a library wanted to do this, my guess is either:

a) devx would suffer since wouldn't be writing functions normally, but rather defining them as objects with verbose constructors, etc.

b) for the sake of devx, the library would have to do some hacky introspection and jump hoops to get to work on the functions themselves, not with them, at the cost of performance or debuggability.

There's already software we use all the time that takes these functions, breaks them apart, understands them and does things with them though— the compiler. It transforms functions to machine code. Let's have it add a step in the middle there, and if a function is marked as being derived, let's have the compiler take it, transform it to its derivative, and then to machine code ¯\_(ツ)_/¯.

>Why choose to carve out this added support for a single family

> of algorithms rather than add in some more general

> metaprogramming abilities that enable better EDSLs?

It is The Swift Way

The answer is always to add something to the language and the compiler, the question seems to be largely irrelevant.

Of course, that's a consequence of not heeding Alan Kay's advice from 1998, that when you design a new language, you need to focus on the metasystem first, the rest will follow.

http://wiki.c2.com/?AlanKayOnMessaging

When you don't do that, every new requirement comes as a surprise that you need to hack into the compiler somehow. Objective-C compatibility: hack the language; Python integration: hack the language; SwiftUI: hack the language; Differentiable Programming: hack the language.

And of course, each additional hack makes the already not-to-elegant base language even more difficult for implementing stuff without hacking the language.

What kind of a metasystem would we need in order to not have to hack the language for all of these features? That, detective, is the right question!

https://www.youtube.com/watch?v=ZKxr0wyIic4

My understanding of automatic differentiation (AD) is that it's only really possible at the compiler level, since you need the ability to interpret and manipulate function definitions themselves. Certainly, no library would be able to offer the same level of guarantees telling you if you've done it wrong, nor the same opportunities for optimisation.
It's certainly not the case that autodiff is only possible at the compiler level. I've implemented forward mode (via dual numbers) and reverse mode (via tapes / wengert nodes) autodiff in libraries before.
notice the qualifier "really". obviously you can implement autodiff kind of outside the complier since pytorch and tensor flow exist. but those implementations constrain you to a select few compositions (please no comments on Turing completeness with just loops and conditionals). so for example if statements in pytorch are not differentiable (they might have piece wise continuous derivates) because pytorch doesn't actually trace the ast. I'm not a languages expert but outside of implementing in the compiler I imagine you'd need a homoiconic language to implement as a library.
If statements aren't really meaningfully differentiable, regardless of how you do it.

Take

    if x == 59:
        return 1000
    else if x > 59:
        return -x
    else:
        return x
How do you optimize this to maximize x, regardless of what language you're in?

It's true that you can get a derivative, but the derivative is essentially meaningless.

  if x <= 0:
      return 0
  else:
      return x
Is in the core of most neural networks today (relu activation), so it is definitely useful.
I don't understand? It's a piecewise differentiable function and you maximize how you maximize any such function: do gradient ascent where it's differentiable and compare against values at the boundary points (ie start, end of interval and points at which there's a removable discontinuity).
I don't disagree that a gradient exists. As the other commenter commented, the gradient/subgradient will usually exist.

What I'm arguing is that this gradient will not allow you to optimize anything of interest for the vast majority of programs.

Yeah sometimes this is called a "subgradient".
There is a derivative, but since the function is non continuous, the derivative will likewise be messed up (but just around x=59). You really don’t want to be climbing a gradient around non continuous functions!
The function has a derivative by some notions of derivative.

But the function's derivative can't be derived by an application of the chain rule and the know derivatives of primitive functions, which is what Algorithmic/automatic differentiation ultimately does (though it does this at run time, not compile time, since ordinary, symbolic differentiation explodes in memory for a complicated functions).

Also:

The continuous function

    int f(int x) {
        if(x > 2) 
            return x + x;
        else 
            return x*2;
    }
Is not automatically or symbolically differentiable when represented that way.
Yeah, Automatic differentiation is essentially only usable on functions specified the way mathematicians specify functions; the compositions of a series of primitives (plus operators like the integral and the differential itself, as well inverse relations).

AD is not usable on loops, conditionals or recursive calls.

So basically, whatever way you specify your functions, you are effectively going to have DSL (within a general purpose language or otherwise) since not all the functions you form are going to be differentiable by AD (and there's some confusion between differentiable in the abstract and differentiable by the methods of AD).

Edit: actually, it's pretty easy to extend AD to functions defined piecewise on intervals to be in the class of function amenable of AD. What's hard/impossible is extending functions defined by loops or recursion.

AD on loops/recursion works fine when you implement it using dual numbers (see for https://news.ycombinator.com/item?id=20893414 an example). If you used this to implement x^n using a for loop, you would get the correct derivative (n * x ^ (n - 1)).
AD is totally possible at the library level (I did it in C# 10 years ago using Conal Elliott’s ideas), however if you want to autodiff more than an expression-only language, compiler support is useful.
Maybe this is the piece I don't understand. What do you mean by "more than an expression-only language"? What is a non-expression in a language? And what would it mean to have a derivative for your non-expression?
You can easily use expressions to create trees rather than values in a library. Eg a + b need not compute a value, through a bit of operating overloading it can compute the tree plus(tree-a, tree-b).

This “trick” does not extend to statements, however. You can’t override if or semicolon in most languages. You can encode statements as expressions, but then you have to worry about things like variable bindings on your own.

Building expression trees is not the only way to do automatic differentiation. A simpler way is to just carry the value and it's derivative as a pair:

    #include <math.h>
    #include <stdio.h>

    struct autodiff { double value, deriv; };

    autodiff just(double x) { return { x, 1 }; }

    autodiff operator +(autodiff a, autodiff b) {
        return { a.value + b.value, a.deriv + b.deriv };
    }

    autodiff operator *(autodiff a, autodiff b) {
        return { a.value * b.value, a.deriv*b.value + a.value*b.deriv };
    }

    autodiff sin(autodiff a) {
        return { sin(a.value), cos(a.value)*a.deriv };
    }

    int main() {
        autodiff x = just(.1);
        for (int ii = 0; ii<4; ii++) {
            x = x*x + sin(x);
        }
        printf("value: %lf, deriv: %lf\n", x.value, x.deriv);
        return 0;
    } 
There is no need to differentiate the for loop or the semicolons. This way is not doing symbolic differentiation. It's implementing the differentiation rules in parallel to calculating the values at run time.

This generalizes to partial derivatives for multivariate functions too:

     template<int dims>
     struct autograd { double value, grad[dims}; };
This only works if you don't have x as a value determining the length of the for loop. If you try to take the derivative of x^x using a loop multiplying x times, you'll get a wrong answer (admittedly, expecting a right answer would be foolish).

But this goes to question at hand, whether AD should be a library/DSL or whether it should be a primitive of a general purpose language. The thing is a general purpose language lets you present sorts of things as functions that can't be even dual numbers won't take the derivative of correctly - a dual scheme can't distinguish variable order loops from constant order loops.