Hacker News new | ask | show | jobs
by wavemode 930 days ago
Intriguing but understandable. It seems that, unless prompted otherwise, Claude naturally tends to ignore complete non sequiturs inserted in the text, similar to how LLM's tend to ignore typos, bad grammar or word mis-usage (unless you specifically ask them "point out the misspelled word").
2 comments

Scaling context is not something humans have good intuition for- I certainly don't recall an exact sentence from 200 pages ago. This is an area where we actually want the models to not mimic us.
We'll need some kind of hybrid system to deal with this. For example the LLM 'indexes' the text it reads and assigns importance weights to parts of it, then as it moves to new text it can check back to these more important parts to ensure its not forgetting things.
I would think there is some benefit to synthesizing, and compressing. Summarization is similar in that the heavier weighed text remains and the rest is pruned.

If the same basic information is all over a text, combine it.

We already know LLMs are good at summarizing.

Question is how good they are are retaining minute details from extremely long context, say 200k tokens.

That’s the frontier Claude and now GPT-4 Turbo are pushing

I guess I’m proposing a new compression, new substitutions, the llm inventing new words to compress common ideas. A bytecode if you will. Compiling the context down.
Interestingly human memory works the other way.

We tend to remember out of place things more often.

E.g. if there was a kid in a pink hat and blue mustache at a suit and tie business party, everybody is going to remember the outlier.

But is it actually that useful to remember the exact words?
RLHF is probably the reason for this.