Hacker News new | ask | show | jobs
by simonw 909 days ago
Lots of comments talking about the model itself. This is Llama 2 70B, a model that has been around for a while now, so we're not seeing anything in terms of model quality (or model flaws) we haven't seen before.

What's interesting about this demo is the speed at which it is running, which demonstrates the "Groq LPU™ Inference Engine".

That's explained here: https://groq.com/lpu-inference-engine/

> This is the world’s first Language Processing Unit™ Inference Engine, purpose-built for inference performance and precision. How performant? Today, we are running Llama-2 70B at over 300 tokens per second per user.

I think the LPU is a custom hardware chip, though the page talking about it doesn't make that as clear as it could.

https://groq.com/products/ makes it a bit more clear - there's a custom chip, "GroqChip™ Processor".

2 comments

this is running on custom hardware, if you’re curious about the underlying architecture check the publication below.

https://groq.com/wp-content/uploads/2023/05/GroqISCAPaper202...

EDIT: i work at Groq, but i’m commenting in a personal capacity.

happy to answer clarifying questions or forward them along to folks who can :)

Is it fixed to a certain llm architecture like llama2? How does it deal with different architectures like MOE for example
It's not fixed and our chip wasn't designed with LLMs in mind. It's a general purpose, low latency, high throughput compute fabric. Our compiler toolchain is also general purpose and can compile arbitrary high performance numerical programs without the need for handwritten kernels. Because of the current importance of ML/AI we're focusing on PyTorch and ONNX models as input, but it really could be anything.

We can also deploy speech models like Whisper, for example, or image generation models. I don't know if we have any MOE architectures, but we'll be implementing Mixtral soon for sure!

Will you be selling individual cards? Are you looking for use cases in the healthcare vertical (noticed its not on your current list)? Working in the medical imaging space and could use this tech as part of the offering. Reach out at 16bit.ai
You can buy individual cards. For example Bittware is a reseller: https://www.bittware.com/products/groq/

But it might be best if you just contact us to explain your needs: https://groq.com/contact/

I can also pass your details on to our sales team.

How easy is it for companies to setup private local servers using Grow hardware (cost and complexity). I've got money. I want throughout.
We've built and deployed racks at a number of organizations. Can you write a message to sales explaining your needs? https://groq.com/contact/

Or if you give me your contact details I can pass them on.

How many chips are used for this demo? Do they have dram? I remember the earlier versions did not have dram.

Are they also used for training or just inference?

I think we use a system with 576 Groq chips for this demo (but I am not certain). There is no DRAM on our chip. We have 220 MB of SRAM per chip, so at 576 chips that would be 126 GB in total.

Graphics processors are still the best for training, but our language processors (LPUs) are by far the best performance for inference!

Could you explain the blockers to getting back-propagation working well on your chips?
Our language processors have much lower latency and higher throughput than graphics processors so we have a massive advantage when it comes to inference. For language models particularly, time to first token is hugely important (and will probably become even more important as people start combining models to do novel things). Additionally, you probably care mostly about batch size 1. For training, latency is not the key issue. You generally want raw compute with a larger batch size. Backpropagation is just a numerical computation so you can certainly implement it on language processors, but the stark advantage we have over graphics processors in inference wouldn't carry over to training.

Does that answer your question?

Everything you say makes sense. Training is definitely more compute intensive than inference.

Training is both memory throughput and compute constrained. Much research in speeding up training goes into optimizing HBM to SRAM communication. The equivalent for your chips would be communication from the SRAM of one chip to the SRAM of another, where it sounds like your architecture has a major memory throughput advantage over GPUs. So I assume you don't have a proportional compute advantage?

By the way, it's great to see a non von Neumann architecture showing a major performance advantage in a real world application. And your chips are conceptually equivalent to chiplets; you should have a major cost advantage on bleeding edge process nodes if you scale up manufacturing. Overall very impressive!

what’s the cost?
right now we’re providing this access to public, anonymous users via this demo chat interface as an alpha test.

we’ll be publishing information about API access, and pricing, shortly after the new year.

Yup, we will be price competitive with OpenAI, and much faster!
You should add latex rendering
This is really impressive. For reference, inference for llama 70b on together’s api generates text at roughly 60 tokens/second.

I can’t find any information about an api, though I’m guessing that the costs are eye watering.

If they offered a Mixtral endpoint that did 300-400 tokens per second at a reasonable cost, I can’t imagine ever using another provider.

We don't have an API in public availability yet but that's coming soon in the new year. We will be price competitive with OpenAI but much faster. Deploying Mixtral is work in progress so keep your eyes open for that too!
Also make a long context Mistral-7B that spits 1000T/s
I'll do it if you promise to say "wow!" :D
Here you go:

https://www.youtube.com/watch?v=9c078xKGwdU

It's 850 tokes per second, so you don't have to say "wow" yet!