| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by menaerus 602 days ago
	It's maybe because the assumption about low latency because everything fits in SRAM is not valid? CS-1 had 18G of SRAM, CS-2 extended it to 40G and CS-3 has 44G of SRAM. None of these is sufficient to run the inference of Llama 70B and much less so of even larger models.

1 comments

latchkey 601 days ago

Exactly. Latency is less relevant if you have to have 4 literal servers (each taking up a whole rack) to push out one single 70B model and we don't know how many concurrent user requests that actually services (probably 1).

link