| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vessenes 437 days ago
	One big area the last two years has been algorithmic improvements feeding hardware improvements. Supercomputer folks use f64 for everything, or did. Most training was done at f32 four years ago. As algo teams have shown fp8 can be used for training and inference, hardware has updated to accommodate, yielding big gains. NB: Hobbyist, take all with a grain of salt

1 comments

jmalicki 437 days ago

Unlike a lot of supercomputer algorithms, where fp error accumulates as you go, gradient descent based algorithms don't need as much precision since any fp errors will still show up at the next loss function calculation to be corrected, which allows you to make do with much lower precision.

cubefox 436 days ago

Much lower indeed. Even Boolean functions (e.g. AND) are differentiable (though not exactly in the Newton/Leibniz sense) which can be used for backpropagation. They allow for an optimizer similar to stochastic gradient descent. There is a paper on it: https://arxiv.org/abs/2405.16339

It seems to me that floating point math (matrix multiplication) will over time mostly disappear from ML chips, as Boolean operations are much faster both in training an inference. But currently they are still optimized for FP rather than Boolean operations.