| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by marksully 86 days ago
	Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.

2 comments

tatef 86 days ago

I'm referencing it as being possible, however I didn't share benchmarks because candidly the performance would be so slow it would only be useful for very specific tasks over long time horizons. The more practical use cases are less flashy but capable of achieving multiple tokens/sec (ie smaller MoE models where not all experts need to be loaded in memory simultaneously)

link

causal 86 days ago

Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...

link