Hacker News new | ask | show | jobs
Thermodynamic Linear Algebra (arxiv.org)
234 points by aifer4 1039 days ago
19 comments

This looks quite amazing at first glance but looking a bit deeper some elephant-in-the-room questions pop up:

1. Simple Gradient Descent is already linear in time w.r.t. # of parameters (but it is an iterative method). It seems to be missing from Table I too. This method requires waiting for equilibration, so could it be seen as another form of iterative method? If so, wouldn't the proper comparison be against known O(n) first-order iterative methods like GD as opposed to exact methods (O(n^3)) or pseudo-2nd-order iterative methods like CG (O(n^2))?

2. In 2.B.3 the paper says "Note that this includes a compilation step that scales as O(d2)". I think this needs some clarification. Is this saying that, in order to run this on actual hardware, there's a compilation step involved that is O(n^2)? Of course O notation says nothing about linear factors but wouldn't that contradict what's stated in Table I?

You're right, that iterative methods may be linear in time with respect to the number of parameters. In this paper, we provide a method which is linear in time with respect to the dimension (d) of the vector (x) we are solving for in the equation A x = b. The number of parameters is d^2 + d, as there are d^2 parameters for the matrix A and d for the vector b. Gradient descent would require a matrix-vector multiplication just to compute the gradient, which is already O(d^2).

You make an important point, regarding the compilation. In this case, we are talking about the time it takes to upload the matrix A and vector b to the hardware. This requires O(d^2) numbers to be updated, but assuming it is done in parallel it could be done in O(d) time, and the coefficient of this scaling is independent of the physical parameters of the analog hardware. For this reason, in the analysis of the algorithm, we are generally ignoring the time to update the parameters, as is clarified in the Methods section.

OK, got it. I think what you're describing is Gradient Descent on the Normal Equations to solve an overdetermined linear system. Indeed in such a system dim(x) == dim(b) == d. Matrix A is fixed though and not part of the estimation but you're correct about the complexity of gradient computation which is indeed O(d^2).

Thanks for the clarification of the uploading / compilation step.

This is very interesting, especially considering the energy-time tradeoff. I wonder how it compares to optical computing [1], where FFT can be done efficiently [2], or to analog computing for linear algebra [3].

[1]: https://en.wikipedia.org/wiki/Optical_computing#Optical_Four...

[2]: https://doi.org/10.1038/s41598-017-13733-1

[3]: https://doi.org/10.1109/MM.2017.55

Had a quick read of [3]. This work comments on three contributions to the error in an analog scheme to solve linear systems via an ODE that is encoded in the circuit dynamics, although the specific ODE being solved is not given in the paper. The three sources addressed are: gain error, offset error, and nonlinearity. It is mentioned that the first two can be corrected by calibration, while the nonlinearity error can be mitigated by scaling down the inputs to the problem (the matrix A and vector b in the equation Ax = b). It says that scaling down the problem results in lower accuracy, which I suspect can be captured by the tradeoff we show analytically between time, energy, and accuracy. It is also mentioned that “when the analog accelerator outputs are steady, we can sample the solutions once with higher-precision ADCs. However, the method here does not involve time-averaging the output of the circuit. A core result of our paper is that the accuracy converges with the length of time over which the output is averaged, so I suspect that taking a single sample is a drawback of the method presented here.
The Optical Fourier Transform (OFT) in [2] is a way to compute the Discrete Fourier Transform (DFT), and is an alternative to the Fast Fourier Transform (FFT). The OFT basically does a matrix-vector multiplication, where the matrix has a special form, which is accomplished physically using the diffraction of light as it propagates from one plane to another. Although the paper claims constant time for this operation, it’s likely that when the optical array gets bigger, the light has to propagate further for the diffraction pattern to emerge, meaning the time gets longer too. But the coefficient of this scaling is extremely small, because light travels fast. This also accelerates linear algebra using physical dynamics, but can only be used to multiply one particular matrix (the DFT matrix) by an arbitrary vector.
Does this hold up when taking quantum mechanics into account?

Let's assume you need at least m = n^2 particles for a physical system modelling a n by n matrix and model the change of the system from setting the state of the particles (to the matrix elements) to measurement by a finite number of interactions between particles (by exchanging a photon):

- a particle can interact with a particle of the heat bath

- a particle can interact with another particle of the m particles of the system

I guess this result holds up if the second interaction kind does not matter because the first interaction alone then takes a constant time for each particle. The whole thing becomes a massively parallel computation (with m threads).

But the second interaction should matter, otherwise how can the system capture/model dependencies between variables (I guess)?

My intuition would be that subsystems of particles get closer to the equilibrium by interaction with the heat bath and then two subsystems combine their wave functions to one by the second kind of interaction. You got subsystems that are in local thermal equilibrium that combine and split their wave functions and as time goes to t_0 the subsystems sizes that are in local equilibrium get larger and larger until they reach size m at time t_0. This does seem to take longer for more particles (not that massively parallel anymore). Anyone got any insight into how this scales?

(This only matters under the assumption that the number of photon exchanges (that each particle experiences) for each of the m particles is finite and constant (or gets larger with larger m) for a fixed temperature. I could easily have missed some things that could make these thoughts irrelevant.)

These results probably would not hold in the same form for a quantum system. By a quantum system, I mean a system where the decoherence time is on the order of the other timescales present in the system (e.g. the correlation time). In fact, it would be much more difficult to engineer such a system, and we would not want one for this purpose; the results rely on convergence to a classical canonical equilibrium distribution, which has to be generalized in the quantum case, meaning it may not have the properties we want. Also, we would have to deal with the measurement backaction on the system in the quantum limit, which we definitely don't want. In the classical limit, where the energy is much larger than Planck's constant divided by the timescale of the system, this is not an issue. One more thing: our algorithms use continuous measurement of the system. For a quantum system, due to the quantum Zeno effect, the system would be effectively "frozen", so we would definitely not sample the full distribution.
> But the second interaction should matter, otherwise how can the system capture/model dependencies between variables (I guess)?

Keep in mind that the venerable (and enormously successful!) gradient descent method does not model dependencies between variables either and manages to find solutions too. It just has to iterate a bit on it -- actually not unlike the method presented here.

The coupling between variables is given by the (quadratic) potential of the system.

I think there is some confusion because they have two tiers of “particles” going on. The first is the masses coupled in the spring-mass system. In that system, each component of x is a particle. However, any specific vector x, in the parlance of thermodynamics, is a single microstate of the system. You can then form a macrostate, I.e., a distribution, of microstate particles, by considering an infinite (or near infinite) number of particles. The dynamics of the macrostate are given by the Fokker-Planck equation, where interactions where both interactions you mention come from the diffusion term only present due to connection with a heat bath.

So the n coupled masses are viewed as a single particle in an abstract system with (stochastic) gradient dynamics.

I've heard complaints about other alternative computing approaches in that they rely on the assumption of arbitrarily precise measurements, which ends up being impossible or taking an extreme amount of time due to the way the physics of measurement works. Could you explain a bit why this is different?
Notice that the complexity depends on the error as eps^(-2), since the method is based on integrating a stochastic differential equation (SDE) over time. So this is an approximation algorithm, and not something that requires "arbitrarily precise measurements".

You could simulate the SDE digitally, but you would probably need d^2 time per iteration, where this approach just initializes the systems and waits for it to converge to a sufficient precision. Turns out the convergence time depends on sqrt(condition number) similar to the best iterative linear solvers, conjugate gradient (CG).

You can debate whether it's fair to assume a fully connected d^2 chip, since a similar size cpu or gpu could perhaps do each iteration of CG in constant time, and so would have the same (or better) complexity as the thermodynamic method. However, each cell the the proposed chip is way simpler than a cpu cell, so it should be cheaper/more energy efficient.

Great explanation!
Good question. There are different ways that errors come into analog computations, including thermal noise, measurement imprecision, and imprecision of the device's physical parameters. This work addresses thermal noise (which is always present at finite temperature), and provides algorithms which are indifferent to thermal noise, or even benefit from it. The other sources of error can, in principle, be made arbitrarily small (at least down to the quantum limit), but in practice are also limiting factors. Progress has been made on error mitigation methods to deal with these other sources of error, so stay tuned.
https://arxiv.org/abs/quant-ph/0502072

Your comment reminded me of this classic of the genre.

An extremely approachable read.

I'm a little late to the party but hopefully someone can still answer this question.

In order to solve the linear system of equations in this framework, you need to integrate a measurement of the system over some time greater than t_0 and tau. However in equation 12 you can see that t_0 and tau are functions of the eigenvalues and the norm of the matrix.

AFAIK the best runtime algorithms we have for computing matrix eigenvalues is still O(d²), so even if the thermodynamic part of algorithm is linear in d, computing how long you would need to run the algorithm for is still quadratic in d, so there's no real gain.

Or am I missing something here?

Indeed the bounds on convergence times t_0 and tau depend on quantities that are expensive to compute. For example, if we consider the norm of A, this quantity can itself be bound by the values the elements A can take itself, which is a requirement on the type of hardware we would be using (as the elements of A are mapped to component values in the hardware that is considered in the appendix). There are other heuristic tricks one could use.

To give a comparison with conjugate gradients, there the condition number is in the convergence bound, however computing it requires the maximum and minimum eigenvalues, hence people never compute it, and rely on heuristics for convergence.

thankyou, that actually cleared things up for me
I’m not one of the authors and have only skimmed the paper thus far and found it interesting enough to warrant further study. I guess one could run the simulation in the hardware and monitor convergence as a function of time. I can also imagine cases where one could work on a class of problems that have a priori known worst case condition, or cases where the exact solution every time is not a hard requirement.
In my layman’s naive view the promise of generative AI is the ability to estimate highly dimensional non linear systems effectively and efficiently. At some level these can be viewed as solving non linear systems. In my career a specific class of important problems has been estimating systems of partial different equations, and specially stochastic partial differential equations. Monte Carlo methods are often the most computable estimations. I’ve often found we approach these things by either actually linearizing or estimating a linearization of the system and using linear algebra to solve, then transforming back. All these techniques requires enormous amount of computation and extremely complex math and numerical methods. To my naive understanding (I’m more of a core systems person that’s been adjacent to the work) quantum promises to help here by directly simulating the system.

Would this thermodynamic technique provide solutions to these sorts of non linear optimization and system solving problems? It feels from my reading it might, and in a simpler way to express.

Forgive my likely display of extraordinary ignorance.

One way to think about these methods is that we are essentially implementing a Monte-Carlo algorithm physically, where on each "iteration" there is a matrix-vector multiplication. The physical system does this matrix-vector multiplication for us in constant time, so it does have an advantage over these digital methods. Not only that, but the "clock speed" of the physical system can be almost arbitrarily short, although this comes with an energy cost.
Yes that’s precisely what made me harken back to my solving of large stochastic PDE system questions :-)

The different though is instead of a matrix multiplication it’s a nonlinear optimization. The crucial part is the nonlinearity. But I assume given this technique is as you say Monte Carlo at its root, that shouldn’t specifically matter?

You mentioned that such problems may be solved by solving a linear approximation of the non-linear problem (and I am in no way an expert in non-linear optimization). To the extent that the bottleneck in that approach is solving the resulting linear system, this method offers a speedup. We are also thinking about using similar thermodynamic methods to solve non-linear systems directly, but some of the nice properties of the harmonic oscillator are not present in that case, so it's currently not clear how much (if any) speedup is there.
A classical computer can solve a linear system in O(N) time on O(N^2) processors, too.
This is an important observation, and is one of the reasons we included an energy-time tradeoff analysis in the paper. To our knowledge, this is the first result where the product energy * time has been shown to scale with dimension for solving linear systems of equations (in any computational paradigm).
The time-energy trade off is of particular interest to me. I’m a roboticist, but I’ve been looking at adapting similar thermodynamic results to research problems in my field (Crooks’ work has been quite inspirational). I’m looking forward to reading this paper in greater detail in a few days when I have time (I’m in the middle of a move) and might want to get in touch about research ideas if that’s of interest.
Impressive work! You give me hope that we'll be able to continue scaling computing in domains which require it.

Also makes me wonder if Google's DWave could be more similar to this method rather than true quantum computing.

Here there are N cells with N^2 couplings. So as you say, a normal computer with this many processors could also solve a linear system in O(N sqrt(kappa)) time. (Using conjugate gradient, since matrix-vector-mult on the system would be O(N) time.)

However, (1) these thermodynamic cells are much simpler than processors, and (2) it seems the overall energy required to simulate the SDE, once the couplings are initialized, only scales with O(N), not O(N^2) as in your digital case.

You can come close to that time complexity on a single CPU by using multigrid methods.

https://en.wikipedia.org/wiki/Multigrid_method

Unless of course your matrix has a prohibitively complicated structure.

Latest research from https://normalcomputing.ai/. Feedback welcome!
We have mass-energy equivalence.

Do we also have information-energy equivalence? Can we use the Landauer bounds to prove by transitivity that information and energy are equivalent?

Straying a bit off topic, but I think one of the more sensible approaches to information-energy equivalence is a thermodynamic engine with an information reservoir (in addition to the heat and work reservoirs normally considered). https://arxiv.org/pdf/1408.1224.pdf https://journals.aps.org/prx/pdf/10.1103/PhysRevX.3.041003
Maxwell's demon?

more of information-entropy equivalence, but close enough

The RC circuit looks familiar. Is it related to neuromorphic computing hardware? Can it be implemented with existing hardware?
An interesting feature of this approach is that the proposed hardware doesn't rely on non-linear elements, memristors, or even active elements (besides an optional noise source). It is simply a passive network of oscillators with a DC bias on each cell. That said, the hardware to implement this at scale does not currently seem to exist. To my knowledge, the state of the art is https://app.normalcomputing.ai/composer
From what I see, it's like mimicking the annealing process and the "derivative" automatically drives you to the solution. If that's the case, implementing such hardware should be not that hard except for the programmable coupling part. A bit off-topic, this reminds me of the duality between any deep forward network and a modern Hopfield network with some special energy functions, in which the duality is based on the fact that the forward running process can be seen as an energy minimization process.
The relationship with Hopfield networks sounds fascinating, would love to discuss further. As you mentioned, there is a connection to annealing in that we are encoding the solution to our problem in the minimization of a physical system's energy. Indeed, the all-to-all coupling is the hard part!
I haven't read the paper, so maybe completely irrelevant, but isn't there an analytical solution for a system of N coupled oscillators?
Isn't there a relation between entropy and computation? It would be interesting to see how these are related.
Definitely. Landauer's principle gives a lower bound on the amount of energy a computation requires, which is k_B T ln(2) times the number of bits erased in the process, the decrease in Shannon entropy. Our energy cost analysis is not based on the Landauer limit, but simply on the energy difference between equilibrium states. But our algorithm for estimating the determinant is based on effectively measuring the entropy difference between equilibrium states.
I always have trouble thinking of computation in terms of “bits erased”. How many bits are erased when I sort an array? Or invert a matrix? Or compute a function like f(x)=1, which seems to maximally erase information, but doesn’t intuitively seem like it should cost a lot of energy.
To answer your question about sorting an array: For an array of length n, where each element takes one of m possible values, there are n^m possible arrays. But there are only O(n^m/n!) possible sorted arrays, which could be crudely approximated as O(n^(m-n)). The decrease in information is proportional to the log of the ratio of the number of possible states before and after the computation, which is in this case log(n^n) = n log n. See another explanation here https://tildesites.bowdoin.edu/~ltoma/teaching/cs231/fall07/...
I really like how the "Results" section immediately follows the introduction. It neatly accommodates the recommended reading order for scientific papers.
At its core, How closely related is this to Hamiltonian Chain Monte Carlo(HCMC) methods?

Reading about waiting to reach equilibrium state, extracting paths and integrating it(IIRC this is analogus to estimating the expected value) in a space made me think of that :). Sorry if the comparison seems naive.

At zero temperature with damped oscillators, can’t I just set the system up and wait for it to reach mechanical equilibrium and read off the positions?
Other than simulating the Hamiltonian on a digital computer, are there any prototype hardware devices that can perform this computation?
Thanks. Do you have somewhere a demo of the hardware performing these computations?
We do not (yet?). Here is a simulation I made of similar hardware demonstrating the equilibration step of the algorithm https://app.normalcomputing.ai/composer/playground
Patrick Coles gives an excellent walkthrough on this as well. https://www.youtube.com/live/dd1jURhLR8Y?feature=share&t=284...
Thanks!
What is the difference between Linear Algebra and "Thermodynamic" Linear Algebra?
This is awesome. I just trained a large neural net on top of my toaster while waiting for my morning pop tarts.
can we factor 35 with Shor's algorithm yet lol