| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by antirez 919 days ago
	Use llama.cpp for quantized model inference. It is simpler (no Docker nor Python required), faster (works well on CPUs), and supports many models. Also there are better models than the one suggested. Mistral for 7B parameters. Yi if you want to go larger and happen to have 32Gb of memory. Mixtral MoE is the best but requires too much memory right now for most users.

4 comments

j-bos 919 days ago

I'm curious, what do you use these small LLMs for, like can you give some examples of (not too) personal uses cases from the past month?

link

SOLAR_FIELDS 919 days ago

My understanding (I haven’t used a fine tuned one) is that you can use one that you fine tune yourself for narrow automation tasks. Kind of like a superpowered script. From my llama 2 7b experiments I have not gotten great results out of the non fine tuned versions of the model for coding tasks. I haven’t tried code llama yet.

link

physicsgraph 919 days ago

Thanks for the suggestion. I'm new to running LLMs so I'll take a look at your suggestion [0]. My ~10 year old MacBook Air has 4GB of RAM, so I'm primarily interested in smaller LLMs.

[0] https://github.com/ggerganov/llama.cpp

link

akx 919 days ago

You don't necessarily need to fit the model all in memory – llama.cpp supports mmaping the model directly from disk in some cases. Naturally inference speed will be affected.

link

TotalCrackpot 919 days ago

Btw, shouldn't it in theory be possible to run the Mixtral MoE loading next submodel sequentially and store outputs and then do the rest of the algorithm to make it easier to run on machines that cannot fit whole model in the memory?

link

wfhpw 919 days ago

Yes but loading weights into memory takes time

link

TotalCrackpot 919 days ago

Yeah I imagine sequential inference would be slower. How long do you have to wait to load these weights on a personal PC? I have not tried using those systems so far.

link

PeterStuer 919 days ago

Python is only used in the toolchain, the inference engine is entirely C/C++.

link