Hacker News new | ask | show | jobs
Deep physical neural networks trained with backpropagation (nature.com)
108 points by groar 1604 days ago
8 comments

Let me see if I can describe the laser part of the paper correctly. They made a laser pulse consisting of a bunch of different frequencies mixed together. The intensity of each frequency represents a controllable parameter of the system. The pulse was sent through a crystal that performs a complex transformation that mixes all the frequencies together in a nonlinear and noisy way. Then they measure the frequency spectrum of the output. By itself, this system performs computations of a sort, but they are not useful.

To make the computations useful, first they trained a conventional digital neural network to predict the outputs given the input controllable parameters. Then they arbitrarily assigned some of the controllable parameters to be the inputs of the neural network and others were arbitrarily assigned to be the trainable weights. Then they used the crystal to run forward passes on the training data. After each forward pass, they used the trained regular neural network to do the reverse pass and estimate the gradients of the outputs with respect to the weights. With the gradients they update the weights just like a regular neural net.

Although the gradients computed by the neural nets are not a perfect match to the real gradients of the physical system (which are unknown), they don't need to be perfect. Any drift is corrected because the forward pass is always run by the real physical system, and stochastic gradient descent is naturally pretty tolerant of noise and bias.

Since they're just using neural nets to estimate the behavior of the physical system rather than modeling it with physics, they can use literally any physical system and the behavior of the system does not have to be known. The only requirement of the system is that it does a complex nonlinear transformation on a bunch of controllable parameters to produce a bunch of outputs. They also demonstrate using vibrations of a metal plate.

Seems like this method may not lead to huge training speedups since regular neural nets are still involved. But after training, the physical system is all you need to run inference, and that part can be super efficient.

> They made a laser pulse consisting of a bunch of different frequencies mixed together

This is how ultra short pulses are made when the waves cancel out appropriately. Now I'm not sure if they are training a network to calculate the filter efficiently for even shorter pulses, or if the purpose is supposed to be an optical neural network, or why not both.

> regular neural net

You used these words several times, and, considered title "physical neural networks", I always wondered if you mean regular like real, or like artificial. If it's artificial, I'm not sure which one of them is "regular" -- LSTM, full, transformers?

I thought it was pretty clear in context that "regular neural net" was a short form of "conventional digital neural network" which I did spell out explicitly the first time.

Any type of artificial neural net could be used. LSTM, transformer, convolutional, fully connected, whatever you want.

This uses a physical system with controllable parameters to compute a forward pass and

> using a differentiable digital model, the gradient of the loss is estimated with respect to the controllable parameters.

So e.g. they have a tunable laser that shifts the spectrum of an encoded input based on a set of parameters, and then they update the parameters based on a gradient computed from a digital simulation of the laser (physics aware model).

When I read the headline I imagined they had implemented back propagation in a physical system

Right,

> Here we introduce a hybrid in situ–in silico algorithm, called physics-aware training, that applies backpropagation to train controllable physical systems. Just as deep learning realizes computations with deep neural networks made from layers of mathematical functions, our approach allows us to train deep physical neural networks made from layers of controllable physical systems, even when the physical layers lack any mathematical isomorphism to conventional artificial neural network layers.

To my naive understanding, and please someone correct me if I'm wrong, the point is that they are not controlling the parameters that compute the NN forward pass directly (hence "no mathematical isomorphism to conventional NNs"), but "hyper-parameters" that guide the physical system to do so. For example, rotation angles of mirrors, or distance between filters, instead of intensity values of light. This leads to the non-linear transformations happening in situ, while simpler transformations in the backprop are still computed in-silico.

> When I read the headline I imagined they had implemented back propagation in a physical system

They touch on that by observing you could train a second physical neural network to compute the gradients for the first. So it could all be physical.

> Improvements to PAT could extend the utility of PNNs. For example, PAT’s backward pass could be replaced by a neural network that directly estimates parameter updates for the physical system. Implementing this ‘teacher’ neural network with a PNN would allow subsequent training to be performed without digital assistance.

So you need to use in silico training a at first, but can get rid of it in deployment.

If you can train a non-linear physical system with this method, in principle, you could also train real brains. You can't update the parameters of the brain, but you can inject signal. Assuming real brains to be black box functions for which you could learn a noisy estimator of gradients, it could be used for neural implants that supplement lost brain functionality, or a Matrix-like skill loading system.
You need a differentiable forward model of the process, which is not available for the human brain.
> Deep-learning models have become pervasive tools in science and engineering. However, their energy requirements now increasingly limit their scalability.[1]

They make this claim first, and cite one source. I haven't heard of this as an issue before. Is there anywhere else I could read more on this?

[1]https://arxiv.org/abs/2104.10350

I don't have a specific reference but I'd say it's a common knowledge assertion based on the growth in the number of parameters in models over the last 10 years. There are lots of places where you can see how the number of parameters, especially in language and vision models, has increased, and find that the amount of training time quoted. Normally it's framed in terms of compute instead of energy.
Got me wondering how this compares with neural efficiency, realizing ofc that there's nothing really apples-to-apples here.

Training one of these big models takes 100kWh for 1e19 flops, so that's 100k Wh, 360M Ws, or 360MJ or 3.6 1e8J. 1e8Joules/1e19flops = 1e-11J/flop

Neurons take 1e-8J/spike.[1]

Math check appreciated :)

Does seem plausible to think of a single neuron spike (hodgkin-huxley cable model) being modeled with ~1k flops. Though I'm firmly of the opinion that nobody really knows how the brain works.. the neural spike activity could be pure epiphenomenon.. who knows!

[1] “Finally, the energy supply to a neuron by ATP is 8.31 × 10−9 J. Meanwhile, integrating the total power with respect to time we will get the consumed electric power, which is 8.75 × 10−9 J. This is more energy than the ATP supplied. The energy efficiency is 105.3%. This is an anomaly…” - 2017 Feb 16 Wang, Xu, Institute for Cognitive Neurodynamics, East China University of Science and Technology https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5337805/

The neural spike is definitely not an epiphenomenon. The action potential / neurotransmitter release / receptor activation process is understood and can be manipulated with electric probes.
Sorry, didn't mean it quite like that. It's clear neural spike activity exists as a physical process. I'm suggesting that spiking activity may be an epiphenomena more primary brain functions, i.e. information processing, consciousness, etc..

As far as I know, we're closest to showing information processing in the visual cortex (which is highly linear) and we're still a long way from knowing how it works at a neural level. But maybe someone here can update on this?

But much of the cortex is highly recurrent (non-linear) and the idea that it's doing something like sending bits between synapses, encoded in spike timing or something.. well, I think that's highly speculative and has plenty of problems. But even if so, that's just "information processing".

I'm personally a fan of electromagnetic theories of consciousness[], where the synaptic activity could be an epiphenomenon of supporting a stand EM field.

[]https://en.wikipedia.org/wiki/Electromagnetic_theories_of_co...

>But much of the cortex is highly recurrent (non-linear) and the idea that it's doing something like sending bits between synapses, encoded in spike timing or something.. well, I think that's highly speculative and has plenty of problems.

I am not sure how much is known about information processing, but it's clear that motor impulses and sensory information are encoded in the spikes. Higher spike frequency = stronger signal. Synapses are how signals are passed from neuron to neuron.

Ok, that's fair. That's i/o and yes, that's known to be highly linear by the time it gets to the efferent nerves, and makes sense it is before that as well. I think that still leaves the vast majority of the cortex using undefined mechanisms.
For those who are curious, consciousness is an epiphenomenon (an emergenty property of brains), while neural spikes are just physics.

See more: https://en.wikipedia.org/wiki/Neural_correlates_of_conscious...

I think it would be better to say something like, "paranoia is an epiphenomenon," when nobody knows what consciousness is.
Training a state of the art model typically involves keeping a very large computer around at near 100% power load. Roughly about 10MW.

The actual limits on DL models (and any simulation or optimization) are: power density and the speed of light, plus the maximum amount of power you can deliver to the area. The speed of light limits how long your cables can be while still doing collective reductions, and the power density limits how much compute power you can fit per unit volume. One could imagine a fully liquid cooled supercomputer at 100MW (located near a very reliable and large power source) with optical fiber interconnect, this would completely change the state of the art in large models overnight.

All true.

I cannot cite a source here, but it is generally believed that the actual effective GPU utilization in AI training clusters which are "100% utilized" is actually quite poor - 23%-26% - due to data movement, non-essential serial execution, and and scheduling issues. So at least for now there is low-hanging fruit to improve the performance of the capital expenses.

Long term, though, DL clusters are basically CAPEX and energy limited.

IMHO, for now, return on the investment is not really a limiting factor, but it will become one once the shine is off the field.

I think they may have provided fewer citations because it felt like a less controversial claim. I think the choice of words was just a bit awkward. To me, it seems like they were asserting that deep learning requires lots of computational resources which is common knowledge. In general, this translates to higher energy requirements.
It's more of an inference and practical thing. If you want to equip something with limited energy (e.g. a drone using a small battery) with the ability to use a neural network for inference, their system could use much less energy than the typical computational setup.
If they can scale it up to GPT-3 like sizes, it would be amazing. Foundation models like GPT-3 will be the operating system of tomorrow. But now they are too expensive to run.

They can be trained once and then frozen and you can develop new skills by learning control codes (prompts), or adding a retrieval subsystem (search engine in the loop).

If you shrink this foundation model to a single chip, something small and energy efficient, then you could have all sorts of smart AI on edge devices.

Physical/analog computers always suffer from noise limiting their usefulness. So I think it would be natural to apply this to a network architecture that includes noise as an integral Part such as GANs or VAEs.
“noise” is integral to all ML systems. You can view this through many lenses, but generalization can be thought of as decoding a noisy signal.
This is true, though what I was getting at was methods that make use of a noise source separate from the input.
How is this different from the good old “chip in the loop” training method?
The paper is interesting
Mystic crystals - The age of Aquarius