|
|
|
|
|
by red75prime
521 days ago
|
|
The remarkable thing about this compression method is that stochastic gradient descent for some reason creates algorithms in the network. Not Turing-complete algorithms, of course, but algorithms nevertheless. An LLM trained on the addition and multiplication data develops circuits for addition and multiplication[1]. It stands to reason that LLM trained on human-produced data develop algorithms that try to approximate the data production process (within their computational limits). [1] https://arxiv.org/abs/2308.01154 |
|