Hacker News new | ask | show | jobs
by krasin 597 days ago
You're correct on $/bandwidth. The point about low latency continues to be ignored, though.
1 comments

It's maybe because the assumption about low latency because everything fits in SRAM is not valid?

CS-1 had 18G of SRAM, CS-2 extended it to 40G and CS-3 has 44G of SRAM. None of these is sufficient to run the inference of Llama 70B and much less so of even larger models.

Exactly. Latency is less relevant if you have to have 4 literal servers (each taking up a whole rack) to push out one single 70B model and we don't know how many concurrent user requests that actually services (probably 1).