| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Difwif 976 days ago
	I've had a suspicion for a while now that this is what ChatGPT does within a conversation (chat.openai.com, not the api). I've had very long chat histories that seem to gracefully degrade instead of just forgetting everything. Maybe there's more clues in the context than I realize though. Either way this type of idea will probably be a fundamental feature for all chat bots in the future IMO.

6 comments

pacjam 976 days ago

Recursive summarization is a simple and popular way to provide the illusion of infinite context (when you need to free up space, just summarize the oldest N messages into 1 summary message). It's lossy and you'll inevitably lose important information, but it should degrade relatively gracefully. In MemGPT we use (implicit) recursive summarization on top of all the explicit memory management.

ASalazarMX 976 days ago

Would this be the same method used to assign a title to your chat based on the first prompt? It's surprisingly effective at getting the core idea most of the time.

pacjam 976 days ago

Thanks for your interest! Question - does the title of the chat ever change after it's first assigned? If so, using a recursive summary to refresh the title sounds like a reasonable idea (especially if you're already computing a summary to extend context).

From what I remember the title in ChatGPT gets set once after a few messages, in which case I'd assume it's generated with a special "title generation" prompt (that gets the first few messages as input).

In either case since I don't work at OpenAI I can't tell you for sure ;)

icelancer 976 days ago

This is how we do things at our work with the API and chunking since we don't have the 32k API. It works fairly well in limited windows.

hansvm 976 days ago

There are definitely a lot more clues than you realize (plus the context window is something like 12 written pages of standard English text, without much space wasted for the system prompts). If you were doing anything interesting at all, the output is heavily biased by your prompt. You lose some bits of information in that you only have one sample (the previous output/history) rather than the soft probabilities, and you lose some bits in that multiple inputs can map to the same output (like the class of prompts "output the 2nd letter of the following phrase: ..."), but real-world prompts tend to be the easiest/shortest thing to come to mind that you think will give you the result you're looking for, so the LLM's best guess for that prompt (there are lots of ways of guessing, so suppose for the sake of argument you did something like textual inversion on the one sample) is likely to not be a half-bad interpretation of the missing context -- i.e., a lot of the seemingly missing information was retained in the LLM's output, and you don't lose too many bits at a time as the old context trails off.

Der_Einzige 976 days ago

ChatGPT degrades precisely because they aren't doing anything special to extend their memory beyond the context length.

There are trivial techniques to implement "lossy" memory, such as just average pooling tokens (the same approach used by sentence transformers). Not sure why it's so rare to see this used for condensing a huge amount of context into a prompt. It is effectively "medium" term memory.

lgats 976 days ago

https://chat.openai.com/share/e367a1de-c28b-4408-aa3d-2e4b85...

Fed chatGPT special numbers, then 3k tokens, then 2k tokens. after that, it was unable to understand any question about the special numbers provided.

sharkjacobs 976 days ago

On the other hand

https://chat.openai.com/share/8a0675b6-2876-4606-ac79-646391...

visarga 976 days ago

At the very least I would average vectors inside single words or word compounds getting a 2-3x reduction in length without much work.

shishirpatil 976 days ago

Yeah! While it’s not known what close-sourced models do, what we think is happening based on some prompt attacks, is that they also use recursive summarization (in addition to what others have mentioned in this thread).

JCharante 976 days ago

To me it just feels like they’re trimming the min amount of oldest tokens in the conversation to stay under the token limit. Conversations don’t degrade in a way that feels like it has medium term memory.

kristopolous 976 days ago

I'm still very much learning this stuff, but I wonder if that's related to the vanishing gradient problem, which seems to be a fundamental aspect of these types of approaches. (Please don't assume that's correct)

https://en.wikipedia.org/wiki/Vanishing_gradient_problem

visarga 976 days ago

Vanishing gradient was an issue for non-residual deep networks and vanilla RNNs. While the long context memory issues are along sequence dimension, not network depth.

The problem could be some kind of instability of attention as it scales above 10k tokens. A recent paper suggests attention mechanism needs a default value (a "sink"), and its absence produces instability.

https://arxiv.org/abs/2309.17453

Another paper says the middle part is lossy while the beginning and end are better attended.

sandkoan 976 days ago

For anyone who's curious, the paper in question, entitled, "Lost in the Middle: How Language Models Use Long Contexts" (https://arxiv.org/abs/2307.03172)

kristopolous 976 days ago

That's a really recent paper. Do you actually keep up to date with everything? How do you find the time?

visarga 976 days ago

Just reading a couple papers every day, the most interesting ones, and following up on reddit and twitter to get notified what people are talking about. And I am directly interested in long-context LLMs for a work related task.

I have also been dabbling with neural nets (pre-transformer), especially LSTM which have a "residual" connection, the one I was mentioning. That makes gradients better behaved. Schmidhuber tech.

totoglazer 976 days ago

Not to denigrate the person you’re responding to, but to add some context: That paper got a decent amount of attention already. Probably one of the more notable in the literature over the last month. Plus compared to the past year everything is slow now.

amelius 976 days ago

Regarding the vanishing gradient problem, has anyone tried to train using only a randomly chosen set of independent parameters in each iteration? (Updating only the weights in a small random independent set).

jdthedisciple 976 days ago

Are you referring to Regularization?

https://www.kaggle.com/code/sid321axn/regularization-techniq...