| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by uoaei 757 days ago

It's also obvious and it's hacky. Frankly I'm stunned this hasn't been tried yet. The people thinking this is a stepping stone to More Intelligence are missing the forest for the trees.

Deep learning is always and only ever about representing data abstractly. The more abstractions you can make irrelevant (why would you have to learn how to do math when the base-10 perspective on ASCII-digits is already provided for you?) the more you've biased your architecture to readily learn and understand the problem space.

Intelligence doesn't exist where Divine Creator gave you access to this or that faculty. It's developing those faculties yourself by reasoning through the process of composing your own mental model about the problem.

3 comments

PeterisP 756 days ago

ASCII digits do not always imply base-10 numbers, they can also be identifiers (e.g. phone numbers), parts of words (IPv6, Log4j), and used in various 'written slang' such as g2g, 4ever, m8 for mate, etc, etc.

And, crucially, I'd argue that for in "chatbot" tasks those other uses are more common than arithmetic, so arbitrary focus to specifically optimize arithmetic doesn't really make sense - the bitter lesson is that we don't want to bias our architecture according to our understanding of a specific problem space but rather enable the models to learn the problem space directly from data.

link

uoaei 756 days ago

You're missing the picture again.

Stepping one level out in the metacognition hierarchy is the key. "Learning to learn" as it were. It is only the relative ease of implementation and deployment of feedforward models like Transformers that makes it seem like we have reached an optimum but we desperately need to move beyond it before it's entrenched too thoroughly.

link

PeterisP 756 days ago

Okay, but it does seem that this hack is in the entirely opposite direction; a pure transformer is more towards "learning to learn" than any special preprocessing to explicitly encode a different representation of numbers.

We probably do have to move beyond transformers, but not in the direction of such hacks, but rather towards even more general representations that could encode the whole class of all such alternate representations and then learn from data which of them work best.

link

uoaei 747 days ago

You seem to be making my point just fine. What was your confusion, then?

link

imtringued 756 days ago

You seemingly missed the part where the next model could learn how to generate its own hierarchical position embeddings. The problem here is obviously that you want the model to look at position i in object a and object b where the position i was chosen by a previous layer. If anything, the answer is probably to just have a dynamic position input from the model into the RoPE embedding, then it can learn the ideal position encoding on its own.

link

pixl97 756 days ago

I'd rather not wait another billion or so years for computers to evolve themselves

link