| Lots of comments talking about the model itself. This is Llama 2 70B, a model that has been around for a while now, so we're not seeing anything in terms of model quality (or model flaws) we haven't seen before. What's interesting about this demo is the speed at which it is running, which demonstrates the "Groq LPU™ Inference Engine". That's explained here: https://groq.com/lpu-inference-engine/ > This is the world’s first Language Processing Unit™ Inference Engine, purpose-built for inference performance and precision. How performant? Today, we are running Llama-2 70B at over 300 tokens per second per user. I think the LPU is a custom hardware chip, though the page talking about it doesn't make that as clear as it could. https://groq.com/products/ makes it a bit more clear - there's a custom chip, "GroqChip™ Processor". |
https://groq.com/wp-content/uploads/2023/05/GroqISCAPaper202...
EDIT: i work at Groq, but i’m commenting in a personal capacity.
happy to answer clarifying questions or forward them along to folks who can :)