|
|
|
|
|
by geeksinthewoods
194 days ago
|
|
The attention lottery framing feels especially timely now that DeepSeek's V3.2 tech report is out in the open. Seeing the actual top-k sparse routing and the post-training RL numbers spelled out makes the trade-offs concrete. Huge wins on speed and context, but every pruned token really is a quiet bet against the weird tail stuff that sometimes sparks real leaps... What struck me most is how much DeepSeek's transparency accidentally lights up the closed models too. Long-context traces and million-token windows almost certainly lean on some variant of this under the hood. This article makes those black boxes feel a lot less mysterious. It leaves me both impressed by the engineering and quietly worried about the curiosity cost. Also, the song / music video at the end is absurd in the best way! |
|