Hacker News new | ask | show | jobs
by wokwokwok 977 days ago
> Resursive summarization (Wu et al., 2021b) is a simple way to address overflowing context windows, however, recursive summarization is inherently lossy and eventually leads to large holes in the memory of the system.

Yes, it does.

> In our experiments ... conversational context is read-only with a special eviction policy (if the queue reaches a certain size, a portion of the front is truncated or compressed via recursive summarization), and working context is writeable by the LLM processor via function calls.

You're doing the same thing, and you have the same problems.

You're just doing it slightly differently; in this case instead of recursively summarizing everything, you're selectively searching the history and generating it for each request. Cool idea.

...but, I'm skeptical; this fundamentally relies on the assumption that the existing context consists of low entropy summarizable context, and that any query relies only on a subset of the history.

This might be true for, eg. chat, or 'answer question about some document in this massive set of documents'.

...but, both of these assumptions are false in some contexts; for example, generating code, where the context is densely packed with information which is not discardable (eg. specific api definitions), and a wide context is required (ie. many api definitions).

It is interesting how this is structured and done, and hey, the demo is cool.

I'm annoyed to see these papers about summary things fail to acknowledge the fundamental limitations of the approach.

1 comments

Thanks for checking out the paper! Just to clarify in case there was any misunderstanding, recursive summarization is just one part of the memory management in MemGPT: as you mentioned, in MemGPT the conversation queue is managed via recursive summarization, just like in prior work (and many chatbot implementations). However there is also a (read/write) "pinned" section of "LLM memory" that's unrelated to recursive summarization, we call this "working context" in the paper. So MemGPT has access to both recursive summaries (generated automatically), as well as working context, which MemGPT actively manages to keep up-to-date.

These are both separate from MemGPT's external context, which is pulled into the conversation queue via function calls. In all our examples, reads from external context are uncompressed (no summarization) and paginated. MemGPT receives a system alert when the queue summarization is triggered, so if MemGPT needs to keep specific details from the conversation queue it can write it to working context before it's erased or summarized.

In the conversational agent examples, working context (no summarization, and separate from the conversation queue) is used to store key facts about the user and agent to maintain consistent conversation. Because the working context is always seen by the LLM, there's no need to retrieve it to see it. In doc QA, working context can be used to keep track of the current task/question and progress towards that task (for complex queries, this helps MemGPT keep track of details like the previous search, previous page request, etc.).

We took a similar approach like MemGPT (working memory: summarized conversation with eviction), but our long memory is a graph we can operate on (add/remove/edit nodes & edges). We bring the top_k nodes and their neighbors in the working memory.
> Just to clarify in case there was any misunderstanding

I am not confused.

It's good; it solves a specific set of problems with querying large datasets, the same as a vector search would.

...but the various memory zones you've created make absolutely no difference to the fundamental limitation of the LLM context length.

No matter how you swing it, this is just creative prompt engineering. You're packing the context with relevant information; but, if you have too much relevant information, it won't work.