|
|
|
|
|
by rasbt
849 days ago
|
|
Yes, it's 8.5B params if you account for weight tying, and 9.3B if you count the embedding layer and output layer weights separately as shown in the 2nd figure in the article. In the paper, I think they justified 7B by only counting the non-embedding parameters (7,751,248,896), which is kind of cheating in my opinion, because if you do that, then Llama 2 is basically a 5B-6B param model. |
|
That's the danger of too much abstraction, it's easy to have big gaps in one's understanding of what's really going on.