Hacker News new | ask | show | jobs
by campers 589 days ago
On Google Cloud a server with 8 TPU v5e will do 2175 token/seconds on Llama2 70B.

https://cloud.google.com/blog/products/compute/updates-to-ai...

From https://cloud.google.com/tpu/pricing and https://cloud.google.com/vertex-ai/pricing#prediction-prices (search for ct5lp-hightpu-8t on the page) the cost for that appears to be $11.04/hr which is just under $100k for a year. Or half that on a 3-year commit.

That seems like a better deal than millions for a few CS-3 nodes.

And they've just announced the v6 TPU:

  Compared to TPU v5e, Trillium delivers: 
  Over 4x improvement in training performance 
  Up to 3x increase in inference throughput 
  A 67% increase in energy efficiency
  An impressive 4.7x increase in peak compute performance per chip 
  Double the High Bandwidth Memory (HBM) capacity 
  Double the Interchip Interconnect (ICI) bandwidth 
https://cloud.google.com/blog/products/compute/trillium-sixt...