Random Machines: Why the Optimizer Is the Least Important Part of Deep Learning

Author here. The core idea is that when you train the same model with different random seeds, both reach the same accuracy but disagree on ~10% of predictions. The reason connects three well-established results (loss landscape geometry, the lottery ticket hypothesis, and mode diversity in weight space) into a picture where the architecture and overparameterization are doing the real work. SGD is just rolling downhill to reveal whichever sparse subnetwork you happened to initialize near.

I reproduced the key findings on an RTX 3090 (ResNet20, CIFAR-10), including the cross-seed disagreement and MIMO's behavior when you try to fit multiple "tickets" into a network that's too small. Wandb logs and code are linked in the post.

Curious if anyone has seen the seed sensitivity problem bite them in production, especially on small on-device models where the landscape is more rugged and you can't afford an ensemble.