| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by plutonorm 1596 days ago
	It's recently been shown that even though the numbers are represented with different tokens, the network learns to form an internal representation that understands the progression from one token to the next.

1 comments

nikolayasdf123 1596 days ago

The idea that each number has to be inside ones Brain or Neural Network or Token is plainly wrong.

Network has to grasp the "abstract" number, but it clearly did not grasp that concept.

link

Isinlor 1596 days ago

How would you test if it grasped the concept?

link

plutonorm 1596 days ago

https://arxiv.org/pdf/2201.02177.pdf

This paper shows fairly conclusively that the network 'groks' modular addition.

link

pfortuny 1596 days ago

Modulo 97.

This is what it is. Not "general arithmetic".

link

catach 1596 days ago

Being able to extrapolate to numbers that were not in the training set, perhaps? At least that'd be a basic part of the requirement.

link

Isinlor 1596 days ago

Sure:

Deep Symbolic Regression for Recurrent Sequences https://arxiv.org/abs/2201.04600

(Interactive demo: http://recur-env.eba-rm3fchmn.us-east-2.elasticbeanstalk.com... )

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets https://arxiv.org/abs/2201.02177

Both of these models can generalize to numbers it have not seen.

link

YeGoblynQueenne 1596 days ago

As far as I can tell from a quick heuristic perusal, the "Generalization Beyond Overfitting" paper reports "generalisation" _on the validation set_. That's not particularly impressive and it's not particularly "generalisation" either.

Actually, I really don't grokk this (if I may). I often see deep learning work reporting generalisation on the validation set. What's up with that? Why is generalisation on the validation set more interesting than on the test set, let alone OOD data?

link

Isinlor 1596 days ago

The point of the paper is to show that NN can still learn long after fully memorizing the train dataset.

This behavior goes against current paradigm of thinking about training NNs. It is just very unexpected, similarly as double descent is unexpected from classical statistics point of view that more parameters lead to more over-fitting.

They could have split validation test set into validation and test sets, but I don't know what that would achieve in their case.

Fig. 1 center shows different train / validate splits. Fig 2. shows a swoop between different optimization algorithms if you are concerned about hyperparameters over-fitting.

But to me really interesting is the Fig 3. that shows that NN learned the structure of the problem.

link