I'd like to mention a thought I had some time ago regarding the idea of using a byte FP format for ML training: instead of describing a byte in a sign/mantissa/exponent format, it might be advantageous to map the byte the 256 possible FP values, using a lookup table, to ideally chosen values. The curve implemented could be a sigmoid curve, for example. This would reduce quantization effects, likely not only resulting in a better convergence, but consistently so.
Maybe it would be necessary to adjust the curve to facilitate the reverse lookup, and reduce the time and silicon needed.
Interesting read. I wonder if this is only some bandwidth optimization to throw more hardware at the problem or an actual shift in perspective, ref no NaN/Inf, instead clamps to maxval. Could this introduce artifacts/will math libs need to code around this, or will this enable some new insight?
Maybe it would be necessary to adjust the curve to facilitate the reverse lookup, and reduce the time and silicon needed.