|
|
|
|
|
by vishal0123
1204 days ago
|
|
From the paper > For contexts and models with d_model > n_ctx/12, the context-dependent computational cost per token is a relatively small fraction of the total compute. For GPT3, n_ctx is 4096 and d_model is 12228 >> 4096/12. |
|