Hacker News new | ask | show | jobs
by kolinko 640 days ago
The chips are optimised for matmuls, but not for transformer architecture per se. With dedicated ASICS, and weights hardcoded (or stored in SRAM) we could theorically get 1 token per one cycle - so millions/billions of tokens per second, not hundreds.

Etched, for example claims they have a chip reaching 500k tok/s in the works. Which is still far from the theoretical max with the current techology.

A similar scenario went with Bitcoin's GPU/FPGA/ASIC - the current ASICs are millions of times faster than GPUs.

1 comments

That’s fine if you never need to improve the model, which is valid in some use cases, but for chat style interaction or even code generation you’ll regularly have to update the weights.
Depends on a chip architecture - etched claims 0.5M tok/s with weights that can be updated. The main constraint is with the model architecture, where it needs to be specific transformer-based model. But they claim the chip can do both Mixtral and Llama - so the constraints are not too stiff.