Hacker News new | ask | show | jobs
by ingenieroariel 837 days ago
I have.

On a Mac Studio with NixOS based Asahi Linux and 128Gb of RAM, mixtral 8x7b uses 49GB of RAM. At the same time I load airflow tasks that deal with world wide datasets (using ~60GB on 16 parallel streams with the performance cores) format is parquet and also mmaped.

Computer still has 8 efficiency cores and the whole GPU for visualizing the maps using lonboard / browsing / etc.

The computer uses 8-10W when idle, ~100W when running jobs or actively using the LLM and around ~200W when really using the GPU.

This makes it very efficient energy wise in my book compared to the beast of keeping a modern CPU and nvidia GPU on when idle. My electricity bill is unaffected.

1 comments

Interesting, thanks for sharing that! For curiosity, what kind of performance you get with that setup + mixtral 8x7b in terms of tokens/second?
I just did:

./mixtral-8x7b-instruct-v0.1.Q8_0.llamafile --cli -t 16 -n 200 -p "In terms of Lasso"

I got 15 tokens per second for prompt evaluation and 8 tokens per second for regular eval.

The same hardware can run things much faster on OSX, or if you use more quantization but I prefer to run things at Q8 or f16 even if they are slow. In the future I how to use GPU, ANE and the crazy 1.58 or 0.68 bit quantization but for now this does the trick handsomely.