Hacker News new | ask | show | jobs
by rileyphone 787 days ago
The bigger size is probably from the bigger vocabulary in the tokenizer. But most people are running this model quantized at least to 8 bits, and still reasonably down to 3-4 bpw.
1 comments

> The bigger size is probably from the bigger vocabulary in the tokenizer.

How does that affect anything? It still uses 16 bit floats in the model doesn't it?