Hacker News new | ask | show | jobs
by meghan_rain 1180 days ago
Ah so you think the 32k context window works differently than eg the 4k davinci context window? They didnt just increase ${hyperparam}?
1 comments

Training compute goes up with approximately the 3rd power of the window size.

So turning a 4k window to a 32k window means a 512x increase in compute they'd need (just to maintain similar output quality).

I suspect they must have found a better solution to be able to scale the window so big. They haven't announced what it is.

Very interesting, thanks