Hacker News new | ask | show | jobs
by brokensegue 848 days ago
I'm just looking for ballpark figures. Maybe a common aws instance type
2 comments

Not sure if this is of any value to you, but Ryzen 7 generates 2 tokens per second for the 7B-Instruct model.

The model itself is very unimpressive and I see no reason to play with it over the worst alternative from Hugging Face. I can only imagine this was released for some bizarre compliance reasons.

the metrics suggest it's much better than that
For the 7B IT and a short factual query I see 5.3 tps on a 5 year old Skylake Gold 6154 CPU @ 3.00GHz, 16 threads. Expect a slight increase as we improve scalability.

FYI using the NUQ (4.5-bit) quantization improves throughput by about 1.4x.