Hacker News new | ask | show | jobs
by visarga 1179 days ago
Current LLMs decode in a greedy manner, token by token. In some cases this is good enough - namely for continuous tasks, but in other cases the end result means the model has to backtrack and try another approach, or edit the response. This doesn't work well with the way we are using LLMs now, but could be fixed. Then you'd get a model that can do discontinuous tasks as well.

>> Write a response that includes the number of words in your response.

> This response contains exactly sixteen words, including the number of words in the sentence itself.

It contains 15 words.

The model would have to plan everything before outputting the first token if it were to solve the task correctly. Works if you follow up with "Explicitly count the words", let it reply, then "Rewrite the answer".

3 comments

> but could be fixed

How? The problem is known for a while, for example this article [0] mentions it (as Chain of Thought reasoning). You could think that just having a scratchpad of tokens is enough - you can arguably plan, backtrack and rewrite there [1], right? But this doesn't really work, at least yet - maybe because it wasn't trained for that - and maybe ChatGPT massive logs (probably available only for OpenAI) can help. But the Microsoft report [2] suggests we need a different architerture and/or algorithms? They mention lack of planning and retrospective thinking as a huge problem for GPT-4. Maybe you know some articles on the ideas how to fix this? Backtracking, trying again seems to be linked to human thought - and very well can give us AGI.

[0] https://arxiv.org/abs/2201.11903

[1] https://www.reddit.com/r/ChatGPT/comments/120fi8e/chatgpt_4_...

[2] https://arxiv.org/abs/2303.12712

You may be shocked to hear this but dijkstra’s short path algorithm is the technical answer to this question. We just don’t use it because it’s expensive.
Language chains or tool use where it can also call on itself to solve subproblems. If you don't have to do just one round of LLM interaction you can do complex stuff.
Backtracking to edit the response is theoretically easily solved by training on a masked language modeling objective instead of an autoregressive one, but using it to actually generate text is a bit expensive because you can't just generate one token at a time and be done, you might have to reevaluate each output token every time another token is changed. So I expect autoregressive generation to remain the default until the recomputation effort can be significantly reduced or hardware advances make the cost bearable.
>> Backtracking to edit the response is theoretically easily solved by training on a masked language modeling objective instead of an autoregressive one, but using it to actually generate text is a bit expensive because you can't just generate one token at a time and be done, you might have to reevaluate each output token every time another token is changed.

I can't imagine how training on masked tokens can "easily" solve backtracking, even in theory. Do you have some literature I could read on this?

Discrete diffusion with rewriting can work well. It feels loosely similar to backtracking, if you assume n_steps large enough - need to be able to rewrite any non-provided position though I think (not all setups do this). Downside is the noise in discrete diffusion (in simplest case randomizing over all vocabulary space) is pretty harsh and makes things very difficult practically. Don't have an exact reference on the relationship, but feels similar to backtracking type mechanics in my experience. I found things tend to "lock in" quickly once a good path is found, which feels a lot like pathfinding to me.

Some early personal experiments with adding "prefix-style" context by a cross-attention (in the vein of PerceiverAR) seemed like it really helped things along, which would kind of point to search-like behavior as well.

Probably the closest theory I can think of is orderless NADE, which builds on the "all orders" training of https://arxiv.org/abs/1310.1757 , which in my opinion closely relates to BERT and all kinds of other masked language work. There's a lot of other NAR language work I'm skipping here that may be more relevant...

On discrete diffusion:

Continuous diffusion for categorical data shows some promise "walking the boundary" between discrete and continuous diffusion https://arxiv.org/abs/2211.15089 , personally like this direction a lot.

If you have a pre-made embedding space, SSD-LM is a straightforward method https://arxiv.org/abs/2210.17432

SUNDAE worked well for translation https://arxiv.org/abs/2112.06749 and many other tasks.

My own contribution, SUNMASK, worked reasonably well for symbolic music/small datasets (https://openreview.net/forum?id=GIZlheqznkT), but really struggled with anything text or moderately large vocabulary, maybe due to training/compute/arch issues. Personally think large vocabulary discrete diffusion (thinking of the huge vocabs in modern universal LM work) will continue to be a challenge.

Decoding strategies:

As a general aside, I still don't understand how many of the large generative tools aren't exposing more decoding strategies, or hooks to implement them. Beam search with stochastic/diverse group objectives, per-step temperature/top-k/top-p, hooks for things like COLD decoding https://arxiv.org/abs/2202.11705, minimum Bayes risk https://medium.com/mlearning-ai/mbr-decoding-get-better-resu..., check/correct systems during decode based on simple domain rules and previous outputs, etc.

These kinds of decoding tools have always been a huge boost to model performance for me, and having access to add in these hooks to "big API models" would be really nice... though I guess you would need to limit/lock compute use since a full backtracking search would pretty swiftly crash most systems. Maybe the new "plugins" access from OpenAI will allow some of this.

Backtracking is easily solved with a shortest path algorithm. I don’t see any need for masking if you are simply maximizing likelihood of the entire sequence.
I don't think humans can do this either. What's the problem with producing a result and then fixing it? It's exactly how we do it.