It's maybe because the assumption about low latency because everything fits in SRAM is not valid?
CS-1 had 18G of SRAM, CS-2 extended it to 40G and CS-3 has 44G of SRAM. None of these is sufficient to run the inference of Llama 70B and much less so of even larger models.
Exactly. Latency is less relevant if you have to have 4 literal servers (each taking up a whole rack) to push out one single 70B model and we don't know how many concurrent user requests that actually services (probably 1).
CS-1 had 18G of SRAM, CS-2 extended it to 40G and CS-3 has 44G of SRAM. None of these is sufficient to run the inference of Llama 70B and much less so of even larger models.