Hacker News new | ask | show | jobs
by briancleland 479 days ago
This paper highlights something that should have been obvious: prediction and retrieval are two sides of the same coin. To predict effectively, you must first identify what's relevant. What's remarkable is that a 0.5B parameter model can perform perfect retrieval over 1M tokens when its natural attention patterns are leveraged properly.

It raises an interesting question: what if we designed architectures explicitly around retrieval capabilities? Transformer architectures were designed for prediction, and retrieval emerged as a byproduct. What would an architecture optimized specfically for retrieval look like?

A lot of money has been spent on building out large-scale RAG systems. If the performance improvements promised by the paper are real, the ramifications will be huge. Exciting to see that the authors are promising to release their code - it will be fun to how this model performs on consumer hardware.

1 comments

I think this could be expanded further. You can convert attention traces to knowledge graph with arbitrary and/or dynamic density. Traversing it can also be exotic – zooming in/expanding details at arbitrary points during traversal. With common format you can create topic/knowledge trace packs that could be shared, merged (subtracted?) etc.