|
|
|
|
|
by jiggawatts
836 days ago
|
|
Or to put it another way: they’ve made a compute substrate with the correct ratios of processing power to memory capacity. NVIDIA GPUs were optimised for different workloads, such as 3D rendering, that have different optimal ratios. This “supercomputer” isn’t brute force or wasteful because it allows more requests per second. By having each response get processed faster it can pipeline more of them through per unit time and unit silicon area. |
|
https://youtu.be/WQDMKTEgQnY?si=W0E9Kq6P280l3Wcl
IMO we still need an MLPerf submission or similar to really understand if this is more efficient or more efficient only if you also want to minimize latency.
Nvidia has pulled enough rabbits out of the hat when it comes to MLPerf I’m still not convinced they can’t work some CUDA magic and undercut them on efficiency.