| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by meatmanek 53 days ago
	This model is pretty cool if you don't have a GPU - I was able to get I think 20 or 30 tokens per second on CPU (DDR4 ram) alone. (I don't remember if that was with q4 or q8.) Otherwise, if you have a GPU with more than like 4GB of VRAM, there are better models. Gemma4 and Qwen3.6 (or Qwen3.5 if you need the smaller dense models that haven't yet been released for 3.6) are a good place to start.

1 comments

aziis98 53 days ago

> I was able to get I think 20 or 30 tokens per second on CPU (DDR4 ram) alone

What are you using for inference? I have a recent intel laptop with 32GB of DDR5 and I am getting at most 25tps with the llama cpp vulkan backend (that is the fastest, I also tried sycl but it is a bit slower)

link

meatmanek 52 days ago

Ok, I double-checked, and I get 21-22tps with lmstudio-community/LFM2-24B-A2B-Q4_K_M.gguf running under LM Studio on my i5-12400 with 2x32GB sticks of DDR4 3200. This is with small context (just "Write me a poem about a language model named Liquid" in `lms chat`)

    Prediction Stats:
      Stop Reason: eosFound
      Tokens/Second: 21.10
      Time to First Token: 1.827s
      Prompt Tokens: 42
      Predicted Tokens: 187
      Total Tokens: 229

link