Hacker News new | ask | show | jobs
by varunkmohan 1274 days ago
The magical number for performance is actually memory bandwidth which is actually lower for TPUs compared to A100s. They have more aggregate compute, but it's not trivial to use that to get very low latency on a per request basis.
1 comments

But they have highly likely internal prototypes with higher bandwidth and latency. Also, with distilled latent diffusion one can probably generate text(-images) much faster anyhow as it could produce long chunks of text at once, rather than needing recurrently feed back the new token to the inputs.