| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vikp 1102 days ago

  - I wouldn't use anything higher than a 7B model if you want decent speed.
  - Quantize to 4-bit to save RAM and run inference faster.

Speed will be around 15 tokens per second on CPU (tolerable), and 5-10x faster with a GPU.