Hacker News new | ask | show | jobs
by marksully 86 days ago
Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.
2 comments

I'm referencing it as being possible, however I didn't share benchmarks because candidly the performance would be so slow it would only be useful for very specific tasks over long time horizons. The more practical use cases are less flashy but capable of achieving multiple tokens/sec (ie smaller MoE models where not all experts need to be loaded in memory simultaneously)
Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...