|
|
|
|
|
by sohojoe
212 days ago
|
|
yeah, stochastic is there because we give up control of order of operations for speed so the order in which floating-point additions happen is not fixed because of how threads are scheduled, how reductions are structured (tree reduction vs warp shuffle vs block reduction) Floating-point addition is not associative (because of rounding), so:
- (a + b) + c can differ slightly from a + (b + c).
- Different execution orders → slightly different results → tiny changes in logits → occasionally different argmax token. |
|