Hacker News new | ask | show | jobs
by ingenieroariel 841 days ago
I just did:

./mixtral-8x7b-instruct-v0.1.Q8_0.llamafile --cli -t 16 -n 200 -p "In terms of Lasso"

I got 15 tokens per second for prompt evaluation and 8 tokens per second for regular eval.

The same hardware can run things much faster on OSX, or if you use more quantization but I prefer to run things at Q8 or f16 even if they are slow. In the future I how to use GPU, ANE and the crazy 1.58 or 0.68 bit quantization but for now this does the trick handsomely.