Hacker News new | ask | show | jobs
by andy99 846 days ago
Haha really? I almost added the caveat that I didn't count the parameters myself. And I couldn't see the weights file size because it requires login (because of their restrictive licensing choice). If true and it's 9B that's really dishonest.
2 comments

Yes, it's 8.5B params if you account for weight tying, and 9.3B if you count the embedding layer and output layer weights separately as shown in the 2nd figure in the article. In the paper, I think they justified 7B by only counting the non-embedding parameters (7,751,248,896), which is kind of cheating in my opinion, because if you do that, then Llama 2 is basically a 5B-6B param model.
Is the 2B measured like that as well? I did use it with llama.cpp and noticed it ran slower than I expected.

That's the danger of too much abstraction, it's easy to have big gaps in one's understanding of what's really going on.

Yes, it's somewhat similar to the 2B model as it uses the same vocabulary size.
Practically, speaking, I OOM'd running Gemma on a 3090 using a config that had VRAM to spare for Mistral 7B. It kinda surprised me at first, until I realized why.