| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by usernametaken29 34 days ago
	> δ-mem compresses past information into a fixed-size state matrix updated by delta-rule learning This doesn’t solve the capacity problem of memory. You can cram more into one context window, but then again you need to associate them with input queries. That’s very hard because slight variations in input create hugely different activations. So really, it doesn’t improve caching. This paper might do a thing or two approximating the compression limit for context windows, but there’s a fundamental limit on how much information can go into it. What you really need is contextual search, as in, different events and objects with the same abstractions and semantic lead to same response, so you can cache effectively… on this front the paper does little to improve “memory” in a meaningful way

5 comments

jsemrau 34 days ago

I am currently working on deep context query which uses dynamically generated regex to pull only the relevant context blocks. By using lightweight RegEx pattern matching to detect semantic intent and filter structured context sections accordingly, you avoid the attention degradation that comes from stuffing semantically redundant information into the window

https://jdsemrau.substack.com/p/tokenmaxxing-and-optimizing-...

structuredPizza 34 days ago

The more real world use cases we see, the more we see the use of a well thought out regex as a bridge from probabilistic to deterministic.

pbronez 34 days ago

Interesting approach.

> Prioritize recall over precision.

Have you tried stemming your regex? That would help you catch messages where a different form of your word appeared. For example instead of “story” you look for “stor” which catches “stories” as well.

Then you might think, could we do an even better job by figuring out the general semantic intent of the query and history? Let’s project them into a semantic vector space! That’s an embedding.

Then you want to query that, which means you need a vector database. So now we can take the query, embed it, query the vector DB with that embedding and retrieve the N closest history documents. You can use that to augment the generation of the response to your prompt.

This is RAG.

Anyway, interesting to see different degrees of sophistication here. Certainly a handful of naive regex are very snappy.

There’s probably a hybrid approach where you use sophisticated NLP and embedding techniques to robustly define topics, then train a regex to approximate that well.

jsemrau 34 days ago

That assumes one layer of memory. In my experience you need to have at least 4 layers of memory to work well. All of them have different requirements for retrieval. Everything that is in short-term memory (state of the app, current conversation, current workspace artefact) requires fast latency and precision. For example if you want to edit a segment in a financial analysis, a blog post, or a program you only want to edit this segment. RAG on a VectorDB is overkill in my opinion.

ogogmad 34 days ago

This is one of the most interesting comments I've read on this website.

jsemrau 33 days ago

Thank you.

in-silico 34 days ago

While there is a limit to the amount of information you can fit in a fixed-size state, the theoretical ceiling is pretty high.

A Hebbian associative matrix (one of the simplest and weakest memory constructions) can store about 0.7 bits of information per parameter. If you have a state with 300M parameters (the size of a Llama 3 8B KV cache at 10K context length), and a context with 2.1 bits of entropy per token (a reasonable estimate), then the state can encode 100M tokens worth of information.

Real models obviously aren't powerful enough to operate at the limit, but you can see why this is a promising research direction.

RandomBK 33 days ago

> context with 2.1 bits of entropy per token

Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter, and tokens encode a lot more than that - sometimes full words, with multimodal even more. If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.

in-silico 33 days ago

> Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter

The reference I always go back to is the GPT-3 paper. The cross-entropy loss (an upper bound for entropy) got down to 1.75 nats (2.5 bits). I took 2.1 because 2.5 is an upper bound and I wanted the estimate to end up as a round number.

> If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.

Here's the thing: the concepts that the model stores in the KV cache are a deterministic function of the input tokens. Similar to the data processing inequality, this implies that no entropy is actually added.

Looking at it mechanically, a sufficiently powerful model only needs to encode the tokens and can recompute concepts later as needed.

usernametaken29 34 days ago

While 100 million tokens sounds a lot, think about it for a bit, and you’ll see why it is basically nothing. Try to cram a human lifetime of sounds, smells, video and more sensory data into 100 million tokens. Heck, try to process the video plot of a single series into that window. It just won’t work, it won’t scale, and is laughable compared to contextual memory. I’m not saying that to belittle the authors of the paper but the reality is that this has very little to do with transient long term memory.

ltbarcly3 34 days ago

You don't remember a lifetime of smells. You don't have any memories from huge swaths of time. There are entire years of your life compressed down to vibes and a handful of events you largely misremember.

usernametaken29 34 days ago

That’s a very weak argument. Memories are not exact replica of experiences. We know that many memories are retained through a lifetime, particularly the ones from early childhood. Unlike computers we always reconstruct memories from several modalities. Even if we remember largely on vibes as you say (which is not true when you look into neuroscience), the sheer amount of information is overwhelming. Again, try to run a 90 minute movie through an LLM memory system. It won’t be able to tell you the plot. That’s before you even feed it sound. Even 100M tokens is not enough for that. You on the other hand will largely remember the movies you liked and their major plot lines and from there be able to reconstruct its scenes. I think the engineers working on memory vastly underestimate the capacity problem of discrete states.

ltbarcly3 32 days ago

blah blah we know that blah neuroscience blah blah blah.

This isn't an argument you are making, it's just an assertion that you could make an argument if you are so inclined, but you won't be doing so at this time, but "science" is obviously on your side, but you can't be bothered to say how or even enough detail for someone to check what you are referring to. I can do that to, see my first sentence in this reply.

I don't know how LLM memory systems work. I do know that you don't have a lifetime of remembering everything with high precision. Not only do most people not remember the plot of most of the movies they have seen, they can't reliably list most of the movies they have seen. Not everyone has a good memory. My point is that it's not valid to reference a false model of how human memory works as a reason some specific LLM memory implementation isn't useful for solving some problems.

kami23 34 days ago

Exactly, and for a given task you don't need to recall what your friend's brother's name is to do a git commit and push. There's a pull for more context to make these things better, but also the pull to make these execute in such a small context effectively when appropriate.

I'm more on team small tasks because of my love of unix piping, I keep telling folks, as a old Linux dude, seeing subagents work together for the first time felt like I was learning to pipe sed and awk for the first time. I realized how powerful these could be, and we still seem to be going that direction.

in-silico 34 days ago

I think you underestimate just how much information 100M words-ish of information is. It's like a 300,000 page novel. That's a 50 foot (~15 meter) thick book.

Surely with (much less than) 300K pages you could describe every meaningful detail of a video series' plot. You don't need to remember the exact pixel values.

You can also scale the numbers up. I specifically chose a relatively small model and short context length as a reference, so 100x bigger is not out of question. At that point, with a 10B token capacity, you are looking at all of English Wikipedia in a single state.

vdelpuerto 34 days ago

I wrote something about it trying to look other way around the context or memory data in models. The gravitational pull of information stills very hard to manage. Ive been using "functional scars" about 30 days now and getting good results in repetitive mistakes across sesions. https://github.com/VDP89/fscars

jandrese 34 days ago

So instead of a FIFO approach to memory management it instead continually degrades the existing data the more you put in? Details start getting lost or mangled more and more over time?

trollbridge 34 days ago

That’s basically what happens.

As you hit the limits and try to compact the context, etc., things get more erratic.

kordlessagain 34 days ago

Like Ferricula: https://deepbluedynamics.com/ferricula (site/docs still in progress).