cost effective in what sense? groq doesn't achieve high efficiency, only low latency. but that's not done in a cost-effective way. compare sambanova achieving the same performance with 8 chips instead of 568, and with higher precision.
Most important, even ignoring latency, is throughput (tokens) per $$$. And according to their own benchmark [1] (famous last words :)), they're quite cost efficient.
No doubt fast SRAM helps, but from a computation pov imho its that they've statically planned computation and eliminated all locks.
Short explainer here: https://www.youtube.com/watch?v=H77tV1KcWIE (Based on their paper).