That’s fine if you never need to improve the model, which is valid in some use cases, but for chat style interaction or even code generation you’ll regularly have to update the weights.
Depends on a chip architecture - etched claims 0.5M tok/s with weights that can be updated. The main constraint is with the model architecture, where it needs to be specific transformer-based model. But they claim the chip can do both Mixtral and Llama - so the constraints are not too stiff.