Hacker News new | ask | show | jobs
by alevskaya 1354 days ago
There's a big difference between not caring about stability, and being willing to trade precision for better memory bandwidth for an application that doesn't benefit from increased precision. When doing large training jobs on TPUs, stability is paramount! It's true that you have to know more about what you're doing when you reduce bit-depth - the horrors of floating point are harder to ignore, and it's wildly inappropriate for many scientific computations. However the reduction of bit-depth is likely to continue as we seek to make modern models more efficient and economical to train and use.
1 comments

What does this mean in practice? For ML, we usually don't care if a weight is 0.05 or 0.10 cause we have millions of weights. We do care if one 1.237e+27 instead of 1.237e-3 though.
Numerical errors have the annoying tendency to accumulate if you're not careful. So doing one matrix operation with low precision might be okay, while doing a dozen might completely garble your result.
This is not that relevant for ML. Each gradient pass will re-compute your cost function and the gradients so errors are not likely to accumulate. The main thing is to not make errors big enough that you end up in a completely different part of the parameter space derailing progress which is what the above commenter points out.
It is extremely relevant for ML.

I am familiarizing myself with recurrent neural networks and getting them trained online is a pain - I get NaNs all the time except for very small learning rates that actually prevent my networks to learn anything.

The deeper network is, the more pronounced accumulation of errors in online training is. Add 20-30 fully connected (not highway or residual) layers before softmax and you'll see wonders there, you won't be able to have anything stable.

This isn't true in general. Very specific ML algorithms that were likely developed with years of blood and sweat and tears may have this kind of resiliency, but I've been in the the numerical weeds enough here that I wouldn't bet on even that without a real expert weighing in on it - and I wonder what the tradeoff is if it's true there. It's very easy to have numerical stability issues absolutely crater ML results; been there, done that.

I have some ~15 year old experience with the math behind some of this, but actually none with day-to-day deep learning applications using any of the now-conventional algorithms, so my perspective here is perhaps not that of the most pragmatic user. The status quo may have improved, at least de facto.

I'm not really sure there is evidence for that. In fact, depending on your interpretation of why posits[1] work, we may even have empirical evidence that the opposite is true.

1. https://spectrum.ieee.org/floating-point-numbers-posits-proc...

When building a mcmc sampler I was too lazy to properly code a matrix approximation needed to avoid some mathematical black hole and the corresponding underflow. It was cheaper to just ignore the faulty simulations.

Turns out our results were better than the papers we compared to, both in time and precision.

I am not that familiar with ml, but can't you just ignore those faulty weights?

With MCMC, depending on application, it seems risky to just toss out the NaN/inf results. I'd guess these numerical issues are more likely to occur in certain regions of the state space you're sampling from, so your resulting sample could end up a bit biased. In some cases the bias may be small or otherwise unimportant, so the speed-up and simpler code of filtering NaN/inf results is worth it, but in other cases (like when the MCMC samples feed into some chain of downstream computations) the bias may have sneaky insidious effects.
I didn't think deeply about this back then since my parameter estimates where close/better than the literature I compared to, but now I'm interested in checking the distribution of those NaN/inf. If I recall correctly they were uniformly distributed throughout an adaptive phase.
When people talk about AI taking over the world, a funny image pops up in my head where a robot is trying to enter a frying pan. When you ask it why it's doing that, it says "because I feel like [NaN, NaN, 2.45e24, NaN]", which is a perfectly valid reason.

I'm not at all caught up with the this side of ML but my first instinct is that faulty weights would lead to interpretability issues. The numbers represented by NaN/Inf vastly outnumber the ones within precision range, so interpreting them is much more of a guess.

Weight changes in one neuron can have dramatic and non linear or obviously predictable impact on the performance of a full model.
in numerical analysis 101 you learn not to use algorithms that don't have certain properties and numerical stability is one of them

what good will it do to compute something if its error is unbound?

the issue of the accumulation of roundoff errors is generally speaking unavoidable when it's linear but fortunately they tend to be small

"A considerable group of numerical analysts still believes in the folk “theorem” that fast MM is always numerical unstable, but in actual tests loss of accuracy in fast MM algorithms was limited, and formal proofs of quite reasonable numerical stability of all known fast MM algorithms is available (see [23], [90], [91], [62], and [61])." https://arxiv.org/abs/1804.04102