| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vishal0123 1204 days ago

From the paper

> For contexts and models with d_model > n_ctx/12, the context-dependent computational cost per token is a relatively small fraction of the total compute.

For GPT3, n_ctx is 4096 and d_model is 12228 >> 4096/12.