|
|
|
|
|
by danpalmer
178 days ago
|
|
Hardware is a factor here. GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data. There are lots of other factors here, but latency specifically favours TPUs. The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge. |
|
Where are you getting that? All the citations I've seen say the opposite, eg:
> Inference Workloads: NVIDIA GPUs typically offer lower latency for real-time inference tasks, particularly when leveraging features like NVIDIA's TensorRT for optimized model deployment. TPUs may introduce higher latency in dynamic or low-batch-size inference due to their batch-oriented design.
https://massedcompute.com/faq-answers/
> The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.
Both Cerebras and Grok have custom AI-processing hardware (not CPUs).
The knowledge grounding thing seems unrelated to the hardware, unless you mean something I'm missing.