Hacker News new | ask | show | jobs
by jmvalin 3181 days ago
The main problem here is that you're depending on implementation-specific behaviour. If you train on a device, you have to run on a device with exactly the same behaviour. On top of that, some FPUs have very slow (trapping) denormal handling. I'm also unsure how accurate the gradient computation can be when the signal itself has numerical issues.

I don't deny it's a cool hack, but beyond that I don't think I see the point or the problem this is trying to solve.

1 comments

no gradients =D of course, that makes it even harder to train.