| In AI, that doesn't sound too surprising to me right now. I just experiment with some local LLMs, but the differences are pretty huge: Llama 3 8B, Raspberry Pi 5: 2-3 Tokens/second (but it works!) Llama 3 8B, RTX 4080: ~60 Tokens/second Llama 3 8B, groq.com LPU, ~1300 Tokens/second Llama 3 70B, AMD 7800X3D: 1-2 Tokens/second Llama 3 70B, groq.com LPU, ~330 Tokens/second There seem to be huge gaps between CPU, GPU and specialized inference ASICs. I'm guessing that right now there aren't many genius-level architecture breakthroughs, and that it's more about how much memory and silicon real estate you're willing to dedicate to AI inference. |
I think groq doesn't use quantization, so the gap between your hardware and groq would be even further apart.