| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dkislyuk 1164 days ago
	As another commenter said, viewing a neural network as a computation graph is how all automatic differentiation engines work (particularly reverse-mode where one needs to traverse through all the previous computations to correctly apply the gradient), and there were several libraries predating Tensorflow following this idea. The initial contribution of Tensorflow and PyTorch was more about making the developer interface much cleaner and enabling training on a wider range of hardware by developing a bunch of useful kernels as part of the library.

1 comments

dekhn 1163 days ago

I always thought of seeing most computations of functions as computational graphs- at least, when I used MathLink to connect Mathematica to Python, it basically gives you a protocol to break any Mathematica function into its recursively expanded definition. Konrad Hinsen suggested using python's built-in operator overloading, so if you said "1 + Symbol('x')" it would get converted to "Plus(1, Symbol('x')) and then sent over MathLink to Mathematica, which would evaluate Plus(1, x), and return an expression which I'd then convert back to a Python object representation.

I don't think we talked about doing any sort of automated diff (in my day we figured out our own derivatives!) but after I made a simple eigendecomp of a matrix of floats, the mathematica folks contributed an example that did eigendecomp of a matrix with symbols (IE, some of the terms weren't 5.7 but "1-x"). Still kind of blows my mind today how much mathematica can do with computation graphs.

IIUC this is the basis of LISP as well.

link

dkislyuk 1163 days ago

The one distinction I would add with neural networks is that it's not just a recursive tree traversal that one would get when evaluating an arithmetic statement, but an actual graph: a computation node can have gradients from multiple sources (e.g. if a skip connection is added), so each node needs to keep accumulated state around that can be updated by arbitrary callers.

Of course, optimized autograd / autodiff is more parallelized than node-based message passing, but it's a useful model to start with.

link

dekhn 1163 days ago

I'd have to think about this for a while but I'm not sure I see that as a distinction. if you have a skip conneciton, that's just another node in the graph you can have to execute topologically before your dependent node, and then pass the data. over the edge when the child node is ready to consume.

What you're describing with node-based message passing sounds much more like a petri net, or other agent-based discrete event modelling system. Which is another powerful mental paradigm, but challenging to reason about.

link

pmoriarty 1163 days ago

"What you're describing with node-based message passing sounds much more like a petri net, or other agent-based discrete event modelling system."

It sounds like Smalltalk to me.

link

garganzol 1163 days ago

In terms of abstract algebra, there are no distinctions. What you call gradients are actually data flows. "From multiple sources" - means that a function can take multiple parameters (= inbound gradients, inflows).

link