Hacker News new | ask | show | jobs
by whazor 521 days ago
You could consider a LLM as a very lossy compression artifact. Where they took terabytes of input data, and ended up with model under the 100 gigabytes. It is quite remarkable what such a model can do, even fabricating new output that was not in the input data.

However, in my naïvety, I wonder whether vastly simpler algorithms could be used to end up with similar results. Regular compression techniques work with speeds up to 700MB/s.

5 comments

The remarkable thing about this compression method is that stochastic gradient descent for some reason creates algorithms in the network. Not Turing-complete algorithms, of course, but algorithms nevertheless.

An LLM trained on the addition and multiplication data develops circuits for addition and multiplication[1].

It stands to reason that LLM trained on human-produced data develop algorithms that try to approximate the data production process (within their computational limits).

[1] https://arxiv.org/abs/2308.01154

Interesting. I am not sure whether there are any 'normal' compression techniques that actually create algorithms. That might be an interesting approach to normally compress data as well.

  > However, in my naïvety, I wonder whether vastly simpler algorithms could be used to end up with similar results.
Almost certainly. Distillation demonstrates this. The difficulty is training. It's harder to train a smaller network and harder to train with less data. But look at humans, they ingest far less data and certainly less diverse data. We are extremely computationally efficient. I guess you have to be when you run on meat
> they ingest far less data

True in terms of text, but not if you include video, audio, touch etc. Sure, one could argue that there is much less information content in video than their raw bytes, but even so, we spend many years building a world model as we play with tools, exist in the world and go to school. I don't deny humans are more efficient learners but people tend to forget this. Also, children are taught things in ascending order of difficulty, while with LLMs we just throw random pieces of text at it. There is sure to be a lot of progress in curriculum learning for AI models.

I'm not sure how accurate it is but my gut feeling is that the level of meaningful compression is somehow correlated to the level of intelligence behind a model, I wouldn't be surprised if it ends up being a major focus in general intelligence.
This is the whole premise behind transformers and ChatGPT models and has been discussed by Ilya[0].

[0] https://the-decoder.com/openai-co-founder-explains-the-secre...

If they could get to a 5.2 Weissman compression score it would probably make a substantial difference.
You could consider the human mind to be a very lossy compression artifact.