| > GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data. Where are you getting that? All the citations I've seen say the opposite, eg: > Inference Workloads: NVIDIA GPUs typically offer lower latency for real-time inference tasks, particularly when leveraging features like NVIDIA's TensorRT for optimized model deployment. TPUs may introduce higher latency in dynamic or low-batch-size inference due to their batch-oriented design. https://massedcompute.com/faq-answers/ > The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge. Both Cerebras and Grok have custom AI-processing hardware (not CPUs). The knowledge grounding thing seems unrelated to the hardware, unless you mean something I'm missing. |
The citation link you provided takes me to a sales form, not an FAQ, so I can't see any further detail there.
> Both Cerebras and Grok have custom AI-processing hardware (not CPUs).
I'm aware of Cerebras' custom hardware. I agree with the other commenter here that I haven't heard of Grok having any. My point about knowledge grounding was simply that Grok may be achieving its latency with guardrail/knowledge/safety trade-offs instead of custom hardware.