| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nacs 1026 days ago

Yes, 7B is perfectly usable on low-end hardware if you're using it for instruction tuning/chat.

But for code completion in an IDE where it has to react as you type, every 100 millisecond delay in response time is noticable.

Even with a 24GB GPU, a 7B model doesn't feel snappy enough for code-completion in an IDE.

2 comments

evolve7942 1026 days ago

GPU RAM quantity isn’t typically correlated to inference rate. Precision/quantization levels do affect model size, which will affect inference rate. However, I would expect a smaller model to be faster (less RAM).

link

brucethemoose2 1025 days ago

Llama (and many other llms, I presume) are so memory bandwidth bound that model size is a decent indicator of inference rate.

The smaller the model, the less has to be read from ram for every single token.

Batching mixes up this calculus a bit.

link

brucethemoose2 1026 days ago

This can be addressed with token streaming and input caching.

Would that be enough? shrug

link