Hacker News new | ask | show | jobs
Random Machines: Why the Optimizer Is the Least Important Part of Deep Learning (sotaverified.org)
1 points by uberdavid 63 days ago
1 comments

Author here. The core idea is that when you train the same model with different random seeds, both reach the same accuracy but disagree on ~10% of predictions. The reason connects three well-established results (loss landscape geometry, the lottery ticket hypothesis, and mode diversity in weight space) into a picture where the architecture and overparameterization are doing the real work. SGD is just rolling downhill to reveal whichever sparse subnetwork you happened to initialize near.

I reproduced the key findings on an RTX 3090 (ResNet20, CIFAR-10), including the cross-seed disagreement and MIMO's behavior when you try to fit multiple "tickets" into a network that's too small. Wandb logs and code are linked in the post.

Curious if anyone has seen the seed sensitivity problem bite them in production, especially on small on-device models where the landscape is more rugged and you can't afford an ensemble.