Hacker News new | ask | show | jobs
by Difwif 976 days ago
I've had a suspicion for a while now that this is what ChatGPT does within a conversation (chat.openai.com, not the api). I've had very long chat histories that seem to gracefully degrade instead of just forgetting everything. Maybe there's more clues in the context than I realize though.

Either way this type of idea will probably be a fundamental feature for all chat bots in the future IMO.

6 comments

Recursive summarization is a simple and popular way to provide the illusion of infinite context (when you need to free up space, just summarize the oldest N messages into 1 summary message). It's lossy and you'll inevitably lose important information, but it should degrade relatively gracefully. In MemGPT we use (implicit) recursive summarization on top of all the explicit memory management.
Would this be the same method used to assign a title to your chat based on the first prompt? It's surprisingly effective at getting the core idea most of the time.
Thanks for your interest! Question - does the title of the chat ever change after it's first assigned? If so, using a recursive summary to refresh the title sounds like a reasonable idea (especially if you're already computing a summary to extend context).

From what I remember the title in ChatGPT gets set once after a few messages, in which case I'd assume it's generated with a special "title generation" prompt (that gets the first few messages as input).

In either case since I don't work at OpenAI I can't tell you for sure ;)

This is how we do things at our work with the API and chunking since we don't have the 32k API. It works fairly well in limited windows.
There are definitely a lot more clues than you realize (plus the context window is something like 12 written pages of standard English text, without much space wasted for the system prompts). If you were doing anything interesting at all, the output is heavily biased by your prompt. You lose some bits of information in that you only have one sample (the previous output/history) rather than the soft probabilities, and you lose some bits in that multiple inputs can map to the same output (like the class of prompts "output the 2nd letter of the following phrase: ..."), but real-world prompts tend to be the easiest/shortest thing to come to mind that you think will give you the result you're looking for, so the LLM's best guess for that prompt (there are lots of ways of guessing, so suppose for the sake of argument you did something like textual inversion on the one sample) is likely to not be a half-bad interpretation of the missing context -- i.e., a lot of the seemingly missing information was retained in the LLM's output, and you don't lose too many bits at a time as the old context trails off.
ChatGPT degrades precisely because they aren't doing anything special to extend their memory beyond the context length.

There are trivial techniques to implement "lossy" memory, such as just average pooling tokens (the same approach used by sentence transformers). Not sure why it's so rare to see this used for condensing a huge amount of context into a prompt. It is effectively "medium" term memory.

https://chat.openai.com/share/e367a1de-c28b-4408-aa3d-2e4b85...

Fed chatGPT special numbers, then 3k tokens, then 2k tokens. after that, it was unable to understand any question about the special numbers provided.

At the very least I would average vectors inside single words or word compounds getting a 2-3x reduction in length without much work.
Yeah! While it’s not known what close-sourced models do, what we think is happening based on some prompt attacks, is that they also use recursive summarization (in addition to what others have mentioned in this thread).
To me it just feels like they’re trimming the min amount of oldest tokens in the conversation to stay under the token limit. Conversations don’t degrade in a way that feels like it has medium term memory.
I'm still very much learning this stuff, but I wonder if that's related to the vanishing gradient problem, which seems to be a fundamental aspect of these types of approaches. (Please don't assume that's correct)

https://en.wikipedia.org/wiki/Vanishing_gradient_problem

Vanishing gradient was an issue for non-residual deep networks and vanilla RNNs. While the long context memory issues are along sequence dimension, not network depth.

The problem could be some kind of instability of attention as it scales above 10k tokens. A recent paper suggests attention mechanism needs a default value (a "sink"), and its absence produces instability.

https://arxiv.org/abs/2309.17453

Another paper says the middle part is lossy while the beginning and end are better attended.

For anyone who's curious, the paper in question, entitled, "Lost in the Middle: How Language Models Use Long Contexts" (https://arxiv.org/abs/2307.03172)
That's a really recent paper. Do you actually keep up to date with everything? How do you find the time?
Just reading a couple papers every day, the most interesting ones, and following up on reddit and twitter to get notified what people are talking about. And I am directly interested in long-context LLMs for a work related task.

I have also been dabbling with neural nets (pre-transformer), especially LSTM which have a "residual" connection, the one I was mentioning. That makes gradients better behaved. Schmidhuber tech.

Not to denigrate the person you’re responding to, but to add some context: That paper got a decent amount of attention already. Probably one of the more notable in the literature over the last month. Plus compared to the past year everything is slow now.
Regarding the vanishing gradient problem, has anyone tried to train using only a randomly chosen set of independent parameters in each iteration? (Updating only the weights in a small random independent set).