|
|
|
|
|
by fbdab103
807 days ago
|
|
I think the llamafile[0] system works the best. Binary works on the command line or launches a mini webserver. Llamafile offers builds of Mixtral-8x7B-Instruct, so presumably they may package this one up as well (potentially a quantized format). You would have to confirm with someone deeper in the ecosystem, but I think you should be able to run this new model as is against a llamafile? [0] https://github.com/Mozilla-Ocho/llamafile |
|
My recent work optimizing CPU evaluation https://justine.lol/matmul/ may have come at just the right time. Mixtral 8x7b always worked best at Q5_K_M and higher, which is 31GB. So unless you've got 4x GeForce RTX 4090's in your computer, CPU inference is going to be the best chance you've got at running 8x22b at top fidelity.