Hacker News new | ask | show | jobs
by vessenes 757 days ago
Wow, a lot of grumpiness in here. If it's true that adding like 20 or so tokens to encode column location / decimal spot triples math performance in out of band tasks, that's a big deal. It's a simple fix, it improves performance A LOT, and they even indicate it's not just a party trick, in that the LLM can use the information to do better on related tasks like sorting and list making.

This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.

I'm more interested in the question of how we can find other useful concepts for data -> embedding space like this; can we incept our tokenization inception so it has more inception?

5 comments

This is cool, but special casing digits is unsatisfying.

It makes me think that the authors have correctly identified an issue (positional embeddings) but don't propose a general solution.

I'm not sure if such a thing is possible, but if it is, it would feel more complete. (Fwiw, positional embeddings have had issues for a long time! So a general solution to this would benefit more than just arithmetic. Helpfully, we now have a really good specific example to serve as a baseline for any generalization we seek)

but it makes sense to have a different encoding. Mathematics is a completely different language. Maybe we should have more than one class of encodings.
There were some recent posts (either here or reddit) supporting the claim that different regions activate when reading programs vs when reading text. If we take that to be true; and squint just enough, one could claim that arithmetic and mathematics should be treated differently to language.
Numeracy is definitely associated with different brain regions than just reading. See, e.g. https://www.sciencedirect.com/science/article/pii/S105381191...

(Dehaene also has a book, “The Numbet Sense”)

I would only find that satisfying (from a snobbish and impractical perspective) if we were able to have the model decide: 1) what encoding should this section use? 2) how should I train this encoding?

A mixture of experts but for encodings is interesting, though!

Maybe there's a clean way to implement

For arbitrary documents and queries, how do we reliably segment the text between those two different languages? And if we can do that, why can't the model do it implicitly?
But I don't want tricks. I want to know that it knows so I don't have to continually guess whether it's right or not.
That's simply not possible. Human understanding is still unreliable, even for geniuses.
That’s why I am asking a computer.
I'm with you. I get that this is akin to asking a human, because we're trying to reason, so we will bring along (assumedly) unavoidable deficiencies of human reasoning. But if I were to ask a human genius this question, ne would grab a calculator and employ it as ne did the rest of ner reasoning.

So it seems like we should probably teach LLMs to "use a calculator", rather than try to get them to be more right when doing math 'in their head'.

Indeed, "use a calculator" is "just a trick"!
Solving that will be a much bigger deal but it's at odds with producing a highly accurate emulation of human thought and language. Language models can serve as tools to understand and experiment with logic formulated as natural language but it isn't their primary purpose. What you're asking is equivalent to creating an auditable trace of everything that goes into making a statement which is pretty much impossible even for the person making a statement. We can get close by limiting ourselves to narrow domains like mathematics but even then someone can come along and question the premises on which we construct such a system. I'm not saying it isn't worth pursuing, it just isn't the standard that we should hold a model to when we ourselves are incapable of it. The goal here is to create a system capable of doing the things that a human can do. If you prefer to have a system that behaves within the confines of a mathematical formalism with well defined rules then build that model instead.
It's entirely possible. Don't use LLMs for math. Use the computers we already have that have been capable of doing math accurately for a century. Right tool, right job.
OP said they didn't want tricks from their LLM. Using a calculator, like we do, is technically a trick.
My calculator manages
Your calculator is deterministic. Humans and AI are not.
LLMs are deterministic. We just sample the results, no? Also, no reason AI needs not be deterministic.
> LLMs are deterministic.

In theory, yes. In practice, parallelism combined with floating point math make current implementations fundamentally non-deterministic.

Temperature cannot ever reach 0 (this causes a division error), so they are not deterministic.
Exactly
The point is don't ask an LLM to do tasks that a calculator can do. Ask if to use the calculator, just like most humans would.
A basic transformer architecture performs only a bounded amount of computation per generated token, so it can never emulate a machine computing sufficiently hard problems.
Yes, because it's feed forward. It must have loops to be a Turing machine.
It does. The output is fed back in.
It indeed does, but it must generate a token per loop, and can thereby solve some linearly complex problems, but it cannot solve harder problems.
> This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.

This is muchhhhh different from how tokenization works today. Adding tokens to the vocabulary is free, everything outside that (i.e. string -> tokens) is going to be a major pain in the ass. Doable but annoying and error prone

Doesn’t seem as complicated as, say, coding a lexer for C. And why shouldn’t tokenisation use lexers or an equivalent?
Good old software development. :( Recent case studies:

- llama.cpp wasn't tokenizing properly, and it came to a head with llama3. Essentially every local model before May 2024 is soft-deprecated, new ones have to indicate the proper tokenizer, and that currently only covers a small subset of popular models

- I recently had to review 41 Phi-3 and Llama 3 models, only 3 had the right tokenizer set

Not saying it's impossible, and we definitely should, and I bet it 100% happens, but...*shudders*

Meanwhile, I just wrote a custom tokeniser for my fan control experiment.

It features such amusements as: - Tokens representing the current time of day and day of week, with half-hour granularity. [14:30][Monday], as the debugger reports. - An entirely separate set of numeric tokens for CPU usage and such, on a logarithmic scale. Also features tokens for digit position, measured from the right. - A hardcoded text tokeniser for executable paths. [/nix/store](..cut..)/bin/executable name. I didn't feel like using the usual approach, so I built a huffman compressor to generate the tokens for arbitrary text, because why not. - Tokens representing program state - "just started", "long-running", etc. - Tokens representing the fact that the following text is from `tail -f ~/.bash_history`. - Start-of-segment tokens for each of the above, and also for GPU and CPU core complex power usage.

It's not that many tokens in total, and the input is structured data, so why not represent it as such? I still had sixty-five thousand tokens for the text tokeniser.

engineering vs. science -> scientist-types find such hacks ugly whereas engineers have to pay bills and get things moving fast.
And when engineers accumulate enough related hacks, scientist-types may discover a pattern and find a proper, general solution. But they wouldn't get there without the pile of hacks that are effectively meta-level empirical evidence.
AI research has mostly progressed when there’s been enough processing power to avoid needing to use the old style of hacks rather than any sort of generalization going on.

AlphaZero vs Stockfish wasn’t some outgrowth of existing methods. They basically throw the old style away and started over.

Object recognition, LLM’s etc all involved throwing what used to be unimaginable levels of data and compute at a problem that “suddenly” worked. Not saying the people at OpenAI aren’t clever, but instead that it wouldn’t have worked in 2000.

Yea, sometimes that happen. But I won't say it's must. Scientists work by funding. Engineers work on real world markets.
It's also obvious and it's hacky. Frankly I'm stunned this hasn't been tried yet. The people thinking this is a stepping stone to More Intelligence are missing the forest for the trees.

Deep learning is always and only ever about representing data abstractly. The more abstractions you can make irrelevant (why would you have to learn how to do math when the base-10 perspective on ASCII-digits is already provided for you?) the more you've biased your architecture to readily learn and understand the problem space.

Intelligence doesn't exist where Divine Creator gave you access to this or that faculty. It's developing those faculties yourself by reasoning through the process of composing your own mental model about the problem.

ASCII digits do not always imply base-10 numbers, they can also be identifiers (e.g. phone numbers), parts of words (IPv6, Log4j), and used in various 'written slang' such as g2g, 4ever, m8 for mate, etc, etc.

And, crucially, I'd argue that for in "chatbot" tasks those other uses are more common than arithmetic, so arbitrary focus to specifically optimize arithmetic doesn't really make sense - the bitter lesson is that we don't want to bias our architecture according to our understanding of a specific problem space but rather enable the models to learn the problem space directly from data.

You're missing the picture again.

Stepping one level out in the metacognition hierarchy is the key. "Learning to learn" as it were. It is only the relative ease of implementation and deployment of feedforward models like Transformers that makes it seem like we have reached an optimum but we desperately need to move beyond it before it's entrenched too thoroughly.

Okay, but it does seem that this hack is in the entirely opposite direction; a pure transformer is more towards "learning to learn" than any special preprocessing to explicitly encode a different representation of numbers.

We probably do have to move beyond transformers, but not in the direction of such hacks, but rather towards even more general representations that could encode the whole class of all such alternate representations and then learn from data which of them work best.

You seem to be making my point just fine. What was your confusion, then?
You seemingly missed the part where the next model could learn how to generate its own hierarchical position embeddings. The problem here is obviously that you want the model to look at position i in object a and object b where the position i was chosen by a previous layer. If anything, the answer is probably to just have a dynamic position input from the model into the RoPE embedding, then it can learn the ideal position encoding on its own.
I'd rather not wait another billion or so years for computers to evolve themselves