|
|
|
|
|
by weinzierl
1035 days ago
|
|
They are more resource (time and memory) intensive in training and inference, that is their disadvantage. For a fair comparison you would have to compare a 8k to a 32k pre-trained model with otherwise similar hyper-parameters. OP is about a 32k sugar-coated Llama 2, so I would expect it be similar in performance to other Llama 2 derivatives. |
|
Sorry for the random question, I've just been curious about this for a while and unable to find out and you seem knowledgeable about these extended models.