Hacker News new | ask | show | jobs
by rasbt 849 days ago
Yes, it's 8.5B params if you account for weight tying, and 9.3B if you count the embedding layer and output layer weights separately as shown in the 2nd figure in the article. In the paper, I think they justified 7B by only counting the non-embedding parameters (7,751,248,896), which is kind of cheating in my opinion, because if you do that, then Llama 2 is basically a 5B-6B param model.
1 comments

Is the 2B measured like that as well? I did use it with llama.cpp and noticed it ran slower than I expected.

That's the danger of too much abstraction, it's easy to have big gaps in one's understanding of what's really going on.

Yes, it's somewhat similar to the 2B model as it uses the same vocabulary size.