Hacker News new | ask | show | jobs
by SuchAnonMuchWow 844 days ago
The issue with their approach is that the whole LLM must fit in the chips to run at all: you need hundreds of cards to run a 7B LLM.

This approach is very good if you want to spend several millions building a large inference server to achieve the lowest latency possible. But it doesn't make sense for a lone customer buying a single card, since you wouldn't really be able to run anything on it.