It's not fixed and our chip wasn't designed with LLMs in mind. It's a general purpose, low latency, high throughput compute fabric. Our compiler toolchain is also general purpose and can compile arbitrary high performance numerical programs without the need for handwritten kernels. Because of the current importance of ML/AI we're focusing on PyTorch and ONNX models as input, but it really could be anything.
We can also deploy speech models like Whisper, for example, or image generation models. I don't know if we have any MOE architectures, but we'll be implementing Mixtral soon for sure!
Will you be selling individual cards? Are you looking for use cases in the healthcare vertical (noticed its not on your current list)? Working in the medical imaging space and could use this tech as part of the offering. Reach out at 16bit.ai
I think we use a system with 576 Groq chips for this demo (but I am not certain). There is no DRAM on our chip. We have 220 MB of SRAM per chip, so at 576 chips that would be 126 GB in total.
Graphics processors are still the best for training, but our language processors (LPUs) are by far the best performance for inference!
Our language processors have much lower latency and higher throughput than graphics processors so we have a massive advantage when it comes to inference. For language models particularly, time to first token is hugely important (and will probably become even more important as people start combining models to do novel things). Additionally, you probably care mostly about batch size 1. For training, latency is not the key issue. You generally want raw compute with a larger batch size. Backpropagation is just a numerical computation so you can certainly implement it on language processors, but the stark advantage we have over graphics processors in inference wouldn't carry over to training.
Everything you say makes sense. Training is definitely more compute intensive than inference.
Training is both memory throughput and compute constrained. Much research in speeding up training goes into optimizing HBM to SRAM communication. The equivalent for your chips would be communication from the SRAM of one chip to the SRAM of another, where it sounds like your architecture has a major memory throughput advantage over GPUs. So I assume you don't have a proportional compute advantage?
By the way, it's great to see a non von Neumann architecture showing a major performance advantage in a real world application. And your chips are conceptually equivalent to chiplets; you should have a major cost advantage on bleeding edge process nodes if you scale up manufacturing. Overall very impressive!
I'm not an expert on the system architecture side of things. Maybe a Groqster who is can chime in. But the way I understand it is that you can't improve latency just by scaling, whereas you can improve throughput just by scaling, as long as it's acceptable to increase batch size. Increasing batch size is generally fine for training. It's a batch process! On the other hand, if someone comes up with a novel training process that is highly sequential then I'd expect Groq chips to do better than graphics processors in that scenario.