|
|
|
|
|
by londons_explore
2608 days ago
|
|
Both. Some operations split and join data in non-deterministic ways (especially the order of operations, leading to different floating point rounding). If you shard across multiple machines, weight accumulation order will depend on network latency for example. Also, GPU's aren't anywhere near as reliable as CPU's when it comes to being able to run for hours without any random bit flips/errors. |
|
Ah, of course! A very timely reminder, thanks!
> GPU's aren't anywhere near as reliable as CPU's when it comes to being able to run for hours without any random bit flips/errors.
Now that's worrying. A bit flip can't be expected to be skewed towards any particular bit within a float, so it could easily happen in the exponent, skewing a single value by orders of magnitude one way or the other. Combine that with the rest of your 'good' results and yuck. That's very concerning. Thanks for the warning.