|
|
|
|
|
by orbital-decay
7 days ago
|
|
Their models are organized around inference efficiency from the start, it's what they're focusing on. Also they come from HFT and are good at low-level optimization. For v3, they've been literally reverse engineering Nvidia GPUs for undocumented behavior that helped against memory bottlenecks, writing file systems for efficient model serving, and doing a ton of low-level grunt work in the times where everyone else just relied on torch. Being compute-constrained helped as well - necessity is the mother of invention. |
|
Every little improvement would save them billions, so it's hard to imagine they aren't pouring a lot of resources into that already.