|
|
|
|
|
by ingenieroariel
841 days ago
|
|
I just did: ./mixtral-8x7b-instruct-v0.1.Q8_0.llamafile --cli -t 16 -n 200 -p "In terms of Lasso" I got 15 tokens per second for prompt evaluation and 8 tokens per second for regular eval. The same hardware can run things much faster on OSX, or if you use more quantization but I prefer to run things at Q8 or f16 even if they are slow. In the future I how to use GPU, ANE and the crazy 1.58 or 0.68 bit quantization but for now this does the trick handsomely. |
|