| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by selfhoster11 383 days ago
	DDR3 workstation here - R1 generates at 1 token per second. In practice, this means that for complex queries, the speed of replying is closer to an email response than a chat message, but this is acceptable to me for confidential queries or queries where I need the model to be steerable. I can always hit the R1 API from a provider instead, if I want to. Given that R1 uses 37B active parameters (compared to 32B for K2), K2 should be slightly faster than that - around 1.15 tokens/second.

1 comments

CamperBob2 382 days ago

That's pretty good. Are you running the real 600B+ parameter R1, or a distill, though?

link

selfhoster11 380 days ago

The full thing, 671B. It loses some intelligence at 1.5 bit quantisation, but it's acceptable. I could actually go for around 3 bits if I max out my RAM, but I haven't done that yet.

link

apitman 380 days ago

I've seen people say the models get more erratic at higher (lower?) quantization levels. What's your experience been?

link

selfhoster11 379 days ago

If you mean clearly, noticeably erratic or incoherent behaviour, then that hasn't been my experience for >=4-bit inference of 32B models, or in my R1 setup. I think the others might have been referring to this happening with smaller models (sub-24B), which suffer much more after being quantised below 4 or 5 bits.

My R1 most likely isn't as smart as the output coming from an int8 or FP16 API, but that's just a given. It still holds up pretty well for what I did try.

link