| > is it a matter of absolute capacity still being insufficient for current model sizes This. Additionally, models aren't getting smaller, they are getting bigger and to be useful to a wider range of users, they also need more context to go off of, which is even more memory. Previously:
https://news.ycombinator.com/item?id=42003823 It could be partially the DC, but look at the rack density... to get to an equal amount of GPU compute and memory, you need 10x the rack space... https://www.linkedin.com/posts/andrewdfeldman_a-few-weeks-ag... Previously:
https://news.ycombinator.com/item?id=39966620 Now compare that to an NV72 and the direction Dell/CoreWeave/Switch are going in with the EVO containment... far better. One can imagine that AMD might do something similar. https://www.coreweave.com/blog/coreweave-pushes-boundaries-w... |
What I’m still trying to understand is the economics.
From this benchmark: https://artificialanalysis.ai/models/llama-4-scout/providers...
Groq seems to offer near lowest prices per million tokens and the near fastest end to end response times. That’s surprising because in my understanding, speed(latency) and the cost are trade-offs.
So I’m wondering: Why can’t GPU-based providers can't offer cheaper but slower(high-latency) APIs? Or do you think Groq/Cerebras are pricing much below cost (loss-leader style)?