| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by krasin 597 days ago
	You're correct on $/bandwidth. The point about low latency continues to be ignored, though.

1 comments

menaerus 596 days ago

It's maybe because the assumption about low latency because everything fits in SRAM is not valid?

CS-1 had 18G of SRAM, CS-2 extended it to 40G and CS-3 has 44G of SRAM. None of these is sufficient to run the inference of Llama 70B and much less so of even larger models.

link

latchkey 596 days ago

Exactly. Latency is less relevant if you have to have 4 literal servers (each taking up a whole rack) to push out one single 70B model and we don't know how many concurrent user requests that actually services (probably 1).

link