Hacker News new | ask | show | jobs
by tgtweak 620 days ago
Yeah combining these two would make a lot of sense, there is a big appetite to run larger models - even slower - on clustered hardware. This way you can add compute to speed up the token pace vs adding it just to run the model at all.

It is also possible some of these optimizations could help optimize distribution based on latency and bandwidth between nodes.