Hacker News new | ask | show | jobs
by empath-nirvana 976 days ago
Is there any reason you're just doing everything within a single context window? I experimented with similar stuff months ago and basically parallelized everything into multiple requests to different agents in pre and post-processing steps. The main context window, for example, wasn't aware of memories being generated or retrieved. I had a post-processor just automatically generating memories and saving them, along with all the conversations being saved in a vector database, and a pre-processor that would automatically inject relevant memories and context based on the conversation, even re-writing the history so it would look to the main context window like the memory had always been there.

It saved a lot of space in the main context window for unnecessary system prompts and so on.

4 comments

These are all great points - who or what you ask to manage memory is a design decision and IMO there's two main ways to do it (in the context of chatbots):

* implicit memory management, where the "main LLM" (or for chat, the "dialogue thread") is unaware that memory is being managed in the background (by a "memory LLM", a rule-based script, a small neural network, etc.), and

* explicit memory management (MemGPT), where one LLM does everything

Prior research in multi-session / long-range chat is often implicit, with a designated memory creation process. If I had to guess, I'd say the vast majority of consumer chatbots that implement some type of memory store are also implicit. This is because getting explicit memory management to work requires a lot of complex instruction following, and in our experience this just isn't possible at the moment with most publicly available LLMs (we're actively looking into ways to fix this via eg fine-tuning open models).

The tradeoffs are as you mentioned: with implicit, you don't have to stuff all the memory management instructions into the LLM preprompt (in MemGPT, the total system message is ~1k tokens). But on the upside, explicit memory management (when the LLM works) makes the overall system a lot simpler - there's no need to manage multiple LLM models running on parallel threads, which can add a lot of overhead.

Is it fair to call “implicit”, essentially retrieval augmented generation? While “explicit” is something different?
This is a fascinating approach. I’m working on something similar but as part of the feedback loop, as you said, rewriting history with transactional data as part of the context window. I feel as though the LLM and the NLP could potentially be a more realizable interface to structured data, well, I should say, this is the idea we are exploring. For us, as data is created (within a certain context of the business) we extract the data, generate the embeddings and build out the vector database as to:

Pre and Post-Processing:

- Post-Processing: After the main model responds, a post-processor takes over, automatically generating memories from the conversation and saving them. This ensures that important context is stored without burdening the primary model with these tasks. We also execute any relevan business logic as part of the request, then feed that back to the systems…

- Pre-Processing: Before a new input is sent to the main model, a pre-processor checks saved memories and injects relevant context. * executes logic * It’s as if this pre-processor gives the main model a “refresher” on prior conversations, preparing it to provide more informed and consistent responses.

Multi Agent has several potential, I am having more confidence as there is some level of entropy on agent reply that makes it a worthwhile
Yes, I have a similar solution.