Hacker News new | ask | show | jobs
by localhost 311 days ago
even with t=0 they are stochastic. e.g., non associative nature of floating point operations
1 comments

That is an artifact of implementation. You can absolutely implement it using strict FP. But even if not, any given implementation will still do things in a specific order which can be documented. And then if you're running quantized (including KV cache), there's a lot less floating point involved.