Hacker News new | ask | show | jobs
by janalsncm 804 days ago
Translation: you don’t need to serve 96 layer transformers for ranking and recommendation. You’re probably using a neural net with around 10-20 million parameters. But it needs to be fast and highly parallelizable, and perhaps perform well in lower precisions like f16. And it would be great to have a very large vector LUT on the same chip.
1 comments

Is there a better way to compare performance across these high-end chips? The only comparable numbers I was able to find were the TFLOPS.

Meta seems to be reported these numbers for this v2 chip:

    708 TFLOPS/s (INT8) (sparsity)
    354 TFLOPS/s (INT8)
And I see Nvidia reporting these numbers for its latest Blackwell chips https://www.anandtech.com/show/21310/nvidia-blackwell-archit...

    4500 T(FL)OPS INT8/FP8 Tensor 
Am I understanding correctly that Nvidia's upcoming Blackwell chips are 5-10x faster than this one Meta just announced?
To a rough approximation, yes. The blackwell chip is also ~10x larger in surface area than MTIA, so the costs are proportional.