Hacker News new | ask | show | jobs
by menaerus 602 days ago
It's maybe because the assumption about low latency because everything fits in SRAM is not valid?

CS-1 had 18G of SRAM, CS-2 extended it to 40G and CS-3 has 44G of SRAM. None of these is sufficient to run the inference of Llama 70B and much less so of even larger models.

1 comments

Exactly. Latency is less relevant if you have to have 4 literal servers (each taking up a whole rack) to push out one single 70B model and we don't know how many concurrent user requests that actually services (probably 1).