| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by amelius 85 days ago

This is <1 tok/s for the 40GB model.

Come on, "Run" is not the right word. "Crawl" is.

Headlines like that are misleading.

2 comments

feznyng 85 days ago

Could still be useful; maybe for overnight async workloads? Tell your agent research xyz at night and wake up to a report.

link

maleldil 85 days ago

Assuming 1 token per second and "overnight" being 12 hours, that's 43 200 tokens. I'm not sure what you can meaningfully achieve with that.

link

zozbot234 85 days ago

Sure, but if long-term throughput is a real limitation there's plenty of ways to address that while still not needing to keep anywhere close to all model weights in RAM (which is still the conventional approach with MoE). So the gain of a smaller memory footprint is quite real.

link

smlacy 85 days ago

Yes, and with virtually zero context, which makes an enormous difference for TTFT on the MoE models.

link