Hacker News new | ask | show | jobs
by dbcurtis 3191 days ago
? It sounds like the author is ignoring denormals?

-edit-

Yes, the author is ignoring gradual underflow and the resulting denormal numbers.

So as you move from one binate to the next, the spacing between floating point numbers doubles or halves depending on whether you are increasing or decreasing the exponent. When you reach the binate with the most negative possible exponent, you have two choices: a) round toward zero, which leads to a huge non-monotonic jump in the spacing of numbers on the floating point number line. This is a great annoyance to numerical analysts and leads to convergence instabilities. That is why any modern computer used for numerical work incorporates choice b) gradual underflow, which implies that you must allow non-normalized numbers in the two binates (the two being + and - sign bit) of the most negative exponent, which has the effect of creating another pair of binates around zero. This keeps the spacing of numbers on the floating point number line the same in the four binates around zero. Numerical algorithms are then much more stable.

I haven't looked at what GPU's do, I strongly suspect that they round toward zero, because first of all it doesn't matter much to graphics applications, and secondly, the typical method of handling denormals is to take a trap and drop into software emulated floating point because the cost of the additional hardware to handle denormals is very large and the hardware complexity is crazy-making. A GPU isn't going to want to break the pipeline for a denormal.

4 comments

It looks like nvidia GPUs treat denormals as zeros for single-precision floating point math: http://developer.download.nvidia.com/assets/cuda/files/NVIDI... (sections 4.1 and 4.2)
In the context of graphics processing that trade-off totally makes sense.

Thanks for doing the homework that I was too lazy to do :)

It seems to me that in the context of NN computations, using the lack of gradual underflow as a non-linear element is going to severely limit the dynamic range of the neurons. On the plus side, the non-linear element is a computational freebie. But in addition to limited dynamic range, it makes the NN ridiculously non-portable across hardware implementations.

Actually if you read section 4.6 of that paper you'll see that denormals are the default on sm_20 and above. But you can see in that same section this this can easily be disabled with the ftz flag.

I had to give Jakob custom gemm kernels to do this research. Not sure why the denormal point was left out of this blog as it's pretty critical to the whole experiment.

So a minor correction here. We did explore placing ftz on various instructions inside the matmul ops, but it turns out you don't need anything more than what is already baked into tf by default. All tf gpu primitives are built with -nvcc_options=ftz=true. This means you have an implicit non-linearity after any non-matmul op (provided the scale of computation is near 1e-38). Matmul ops are called through cublas and have denormals enabled.
I'm not sure why you say "ignore".

As I read this, the author claims to have created a naive "linear" network akin regular deep learning networks but without the added (explicitly) non-linearity and shows it's trainable. He acknowledges it has to operate through non-linearity (indeed underflow) and so the mechanisms you mention sound compatible with his findings.

The point I'd see for the article isn't some magic non-linear to linear transformation but that for all we know, incidental underflow effects might operating in regular "non-linear" networks as well.

They added a note at the end clarifying that they have flush to zero mode enabled.

quote from the article: "EDIT: This blogpost assumes that we enable flush to zero (FTZ) which treats denormal numbers as zeros. It’d be interesting to see reseachers try without FTZ!"

you could (in theory) do the same thing anywhere but disabling denormals is the fastest (cpu count) way to create a big (relative displacement from linear) nonlinearity in the IEEE - and family fp representations.

Evolutionary methods to trap nonlinearities is already hard, I imagine it would be even harder to find functions which exploit even more subtle nonlinearities.

The main problem here is that you're depending on implementation-specific behaviour. If you train on a device, you have to run on a device with exactly the same behaviour. On top of that, some FPUs have very slow (trapping) denormal handling. I'm also unsure how accurate the gradient computation can be when the signal itself has numerical issues.

I don't deny it's a cool hack, but beyond that I don't think I see the point or the problem this is trying to solve.

no gradients =D of course, that makes it even harder to train.