Hacker News new | ask | show | jobs
by mjw 3743 days ago
It's more an empirically-verified thing than a mathematical fact, there's nothing magic about 16 bits AFAIK. Empirically 16 bits seems to work well enough for some tasks, taking it down to 8 bits is usually taking it too far, and performance-wise there's not a lot of point playing with values in between e.g. 12 bits.

(Half-float arithmetic is implemented natively in recent CUDA CC5 architectures and is quite convenient, in particular it reduces memory bandwidth by 1/2 which is often the bottleneck.)

Stochastic gradient descent is fairly robust to noisy gradients -- any numerical or quantisation error that you can model approximately as independent zero-mean noise can be 'rolled into the noise term' for SGD without affecting the theory around convergence [0]. It will increase the variance of course, which when taken too far could in practise mean divergence or slow convergence under a reduced learning rate, perhaps to a poorer local minimum.

Extreme quantisation (like binarisation) the error can't really be modelled as independent zero-mean, UNLESS you do the kind of stochastic quantisation mentioned. From what I hear this works well enough to allow convergence, but accuracy can take quite a hit. I don't think it has to be 'implemented natively', although no doubt that would speed it up, a large part of the benefit of quantisation during training is not so much to speed up arithmetic as to reduce memory bandwidth and communication latency.

[0] https://en.wikipedia.org/wiki/Stochastic_approximation#Robbi...