| HN Mirror

I'm not an expert at hardware, so take this with a grain of salt, but there are two main reasons:

- Discrete optimisation is always going to be harder than continuous optimization. Learning the right sparsity mask is fundamentally a very discrete operation. So even just matching fully continuous dense models in optimization efficiency is likely to be difficult. Though perhaps we can get some hope from the fact that MoE is also similarly fundamentally discrete, and it works in practice (we can think of MoE as incurring some penalty from imperfect gating, which is more than offset by the systems benefits of not having to run all the experts on every forward pass). Also, the optimization problem gets harder when the backwards pass needs to be entirely sparsified computation (see appendix B).

- Dense matmuls are just fundamentally nicer to implement in hardware. Systolic arrays have nice predictable data flows that are very local. Sparse matmuls with the same number of flops nominally only need (up to a multiplicative factor) the same memory bandwidth as an equivalent dense matmul, but they need to be able to route data from any memory unit to any vector compute unit - the locality of dense matmuls means that the computation of each tile only requires a small slice of both input matrices, so we only need to load those slices into shared memory; on the other hand, because GPU-to-GPU transfers are way slower, when we op-shard matmuls, we replicate the data that is needed. Sparse matmuls would need either more replication within each compute die, or more all-to-all internal bandwidth. This means spending way more die space on huge crossbars and routing. This would cost a lot of die space, though thankfully, the crossbars consume much less power than actual compute, so perhaps this could match dense in energy efficiency and not make thermals worse.

It also seems very likely that once we create the interpretable GPT-1 (or 2, or 3) we will find that making everything unstructured sparse was overkill, and there are much more efficient pretraining constraints we can apply to models to 80/20 the interpretability. In general, a lot of my hope routes through learning things like this from the intermediate artifact (interpretable GPT-n).

To be clear, it doesn't seem literally impossible that with great effort, we could create custom hardware, and vastly improve the optimization algorithms, etc, such that weight-sparse models could be vaguely close in performance to weight-dense models. It's plausible that with better optimization the win from arbitrary connectivity patterns might offset the hardware difficulties, and I could be overlooking something that would make the cost less than I expect. But this would require immense effort and investment to merely match current models, so it seems quite unrealistic compared to learning something from interpretable GPT-3 that helps us understand GPT-5.