| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by SuchAnonMuchWow 891 days ago
	The issue with their approach is that the whole LLM must fit in the chips to run at all: you need hundreds of cards to run a 7B LLM. This approach is very good if you want to spend several millions building a large inference server to achieve the lowest latency possible. But it doesn't make sense for a lone customer buying a single card, since you wouldn't really be able to run anything on it.