Hacker News new | ask | show | jobs
by _hark 435 days ago
Can anyone comment on where efficiency gains come from these days at the arch level? I.e. not process-node improvements.

Are there a few big things, many small things...? I'm curious what fruit are left hanging for fast SIMD matrix multiplication.

3 comments

One big area the last two years has been algorithmic improvements feeding hardware improvements. Supercomputer folks use f64 for everything, or did. Most training was done at f32 four years ago. As algo teams have shown fp8 can be used for training and inference, hardware has updated to accommodate, yielding big gains.

NB: Hobbyist, take all with a grain of salt

Unlike a lot of supercomputer algorithms, where fp error accumulates as you go, gradient descent based algorithms don't need as much precision since any fp errors will still show up at the next loss function calculation to be corrected, which allows you to make do with much lower precision.
Much lower indeed. Even Boolean functions (e.g. AND) are differentiable (though not exactly in the Newton/Leibniz sense) which can be used for backpropagation. They allow for an optimizer similar to stochastic gradient descent. There is a paper on it: https://arxiv.org/abs/2405.16339

It seems to me that floating point math (matrix multiplication) will over time mostly disappear from ML chips, as Boolean operations are much faster both in training an inference. But currently they are still optimized for FP rather than Boolean operations.

In-memory computing (analog or digital). Still doing SIMD matrix multiplication but using more efficient hardware: https://arxiv.org/html/2401.14428v1 https://www.nature.com/articles/s41565-020-0655-z
This is very interesting, but not what the Ironside TPU is doing. The blog post says that the TPU uses conventional HBM RAM.
There's been some talk/rumour of next-gen HBMs having some compute capability on the base die. But again, not what they're doing here, this is regular HBM3/HBM3e.

https://semiengineering.com/speeding-down-memory-lane-with-c...

Specialization. Ie specialized for inference.