| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by freeqaz 806 days ago
	What's the easiest way to run this assuming that you have the weights and the hardware? Even if it's offloading half of the model to RAM, what tool do you use to load this? Ollama? Llama.cpp? Or just import it with some Python library? Also, what's the best way to benchmark a model to compare it with others? Are there any tools to use off-the-shelf to do that?

5 comments

fbdab103 806 days ago

I think the llamafile[0] system works the best. Binary works on the command line or launches a mini webserver. Llamafile offers builds of Mixtral-8x7B-Instruct, so presumably they may package this one up as well (potentially a quantized format).

You would have to confirm with someone deeper in the ecosystem, but I think you should be able to run this new model as is against a llamafile?

[0] https://github.com/Mozilla-Ocho/llamafile

link

jart 806 days ago

llamafile author here. I'm downloading Mixtral 8x22b right now. I can't say for certain it'll work until I try it, but let's keep our fingers crossed! If not, we'll be shipping a release as soon as possible that gets it working.

My recent work optimizing CPU evaluation https://justine.lol/matmul/ may have come at just the right time. Mixtral 8x7b always worked best at Q5_K_M and higher, which is 31GB. So unless you've got 4x GeForce RTX 4090's in your computer, CPU inference is going to be the best chance you've got at running 8x22b at top fidelity.

link

moffkalast 806 days ago

Correct me if I'm wrong, but in the tests I've run, the matmul optimizations only have an effect if there's no other blas acceleration. If one can at least offload the KV cache to cublas or run with openblas it's not really used, right? At least I didn't see any speedup in with that config when comparing that PR to the main llama.cpp branch.

link

jart 806 days ago

The code that launches my code (see ggml_compute_forward_mul_mat) comes after CLBLAST, Accelerate, and OpenBLAS. The latter take precedence. So if you're not seeing any speedup in enabling them, it's probably because tinyBLAS has reached terms of equality with the BLAS. It's obviously nowhere near as fast as cuBLAS, but maybe PCIE memory transfer overhead explains it. It also really depends on various other factors, like quantization type. For example, the BLAS doesn't support formats like Q4_0 and tinyBLAS does.

link

noman-land 806 days ago

+1 on llamafile. You can point it to a custom model.

link

varunvummadi 806 days ago

The easiest is to use vllm (https://github.com/vllm-project/vllm) to run it on a Couple of A100's, and you can benchmark this using this library (https://github.com/EleutherAI/lm-evaluation-harness)

link

sheepscreek 806 days ago

In that regard, it’s even easier to use one Apple Studio with sufficient RAM and llama.cpp or even PyTorch for inference.

link

hmottestad 806 days ago

LM Studio is a great way to test out LLMs on my MacBook: https://lmstudio.ai/

Really easy to search huggingface for new models to test directly in the app.

link

LeoPanthera 806 days ago

Make sure you get the prompt template set correctly, the defaults are wrong for a lot of models.

link

unifer1 806 days ago

Could you explain how to do this properly ? I've been having problems with the app and am wondering if this is ehy

link

LeoPanthera 805 days ago

Look at the HuggingFace page for the model you are using. (The original page, not the page for the GGUF conversion, if necessary.) This will explain the chat format you need to use.

link

bevekspldnw 805 days ago

There is a user called The Bloke on hugging face- they release pre quantized models pretty soon after the full size drop. Just watch their page and pray you can fit the 4 bit in your GPU.

I’m sure they are already working on it.

link

nathanasmith 805 days ago

TheBloke stopped uploading in January. There are others that have stepped up though.

link

bevekspldnw 805 days ago

Oh really? Who else should I be looking at?

That person is a hero, super bummed!

link

fzzzy 805 days ago

TheBloke's grant ran out.

link