|
|
|
|
|
by moffkalast
836 days ago
|
|
It's not so much an accelerator as it is addressing the main inference bottleneck (i.e. memory latency) with sheer brute force by throwing money at the problem. They've made accelerators out of pure L3 cache with a whopping 230 MB per card. They cited something like 500 cards to load one single Mixtral instance, which probably cost over $10M to build. It's a supercomputer essentially. |
|
NVIDIA GPUs were optimised for different workloads, such as 3D rendering, that have different optimal ratios.
This “supercomputer” isn’t brute force or wasteful because it allows more requests per second. By having each response get processed faster it can pipeline more of them through per unit time and unit silicon area.