Hacker News new | ask | show | jobs
by bconsta 253 days ago
There is a study that gives a rule of thumb of ~2 bits per param for a model's memorization capacity: https://arxiv.org/abs/2404.05405
3 comments

Seems they have replicated Gardner's work, without mentioning it, "Maximum Storage Capacity in Neural Networks" (1987), which established that the storage capacity of a neural network is about 2N (2 bits per parameter)
Elizabeth Gardner for those looking.
I had no idea about this. Thanks for sharing
Recent: 3.6 bits per param

https://arxiv.org/abs/2505.24832

You're both right. The classical capacity measure (Gardner's capacity limit) is defined as the maximum number of patterns that can be remembered with zero errors. This remains 2 bits per parameter, proven mathematically.

The capacity definition in this recent paper is completely different - it is defined based on the kolmogorov complexity of predicting a memorized sequence, or in layman's terms: how easy it is to compress known sequences. This allows for some bit "errors", ie some symbols with bad compression ratio, only the total compression ratio of the sequence is measured.

This is somewhat parallel to the classical ECC limits (strict hamming distance constraints) vs modern probabilistic ECC limits.

TLDR when you allow a small number of errors, the capacity increases from 2 bits to 3.6 bits

2 bits out of FP8 would be 25% 2 bits out of FP16 would be 12.5%

I've seen recent work that claimed 70% of the params are used for memorization.