Hacker News new | ask | show | jobs
by visarga 975 days ago
The Groq AI chip startup has solved this problem. They don't use hand written kernels at all, instead they use a compiler, and they have the top speed in the world on LLaMA2-70B, 240tokens/s.

https://www.youtube.com/@GroqInc/videos

Other interesting Groq tidbits - their models are deterministic, the whole system up to thousands of chips runs in sync on the same clock, memory access and network are directly controlled without any caches or intermediaries so they also run deterministically.

That speeds up communication and allows automatic synchronisation across thousands of chips running as one single large chip. The compiler does all the orchestration/optimisation. They can predict the exact performance of an architecture from compile time.

What makes Groq different is that they started from the compiler, and only later designed the hardware.

1 comments

What is the pass rate on torchbench? This gives a more realistic measure of how good a vendor's pytorch support is.

All the big chip startups have their own pytorch compiler that works on the examples they write themselves. From what I've seen of Groq it doesn't appear to be any different.

The problem is that pytorch is incredibly permissive in what it lets users do. torch.compile is itself very new and far from optimal.