Hacker News new | ask | show | jobs
by earslap 819 days ago
The way I worded it, it might seem wrong - and I agree with you. When I said "constant" I meant without any optimizations to speed up shorter contexts, so with full designed context, architecturally, it is constant. You can pad shorter active contexts with zeroes and avoid attending to empty spaces as an optimization, but that is just an optimization, not an architectural property. If you want "more computation" you fill the context with relevant data (chain of thought, or n-shot stuff), which is the "trick" Karpathy alluded to (it provides more context to attend to), and I agree with that analysis.