|
|
|
|
|
by mtokarski
255 days ago
|
|
Interesting work, but I think the interpretation may be a bit overstated. The authors claim that injecting too much factual "knowledge" during pretraining causes models to collapse — performance drops below the baseline once knowledge frequency crosses a threshold. The problem is how they inject it. Their “knowledge” isn’t natural language; it’s templated Wikidata triples like "X is the capital of Y." That’s a super low-entropy, highly repetitive distribution. When you cram enough of that into a fixed token budget, you’re not really teaching the model more facts — you’re just destroying linguistic diversity and skewing the token statistics. In real pretraining or domain adaptation scenarios, “knowledge” tends to appear in richer, more varied contexts. The practical takeaway isn’t "don’t add too much domain data," but rather "don’t overrepresent any single format or narrow syntactic pattern" The issue seems more about representation homogeneity than about factual density itself. |
|
Essentially they found that by presenting the knowledge in a single, fixed way, the model is trained to reproduce that exact sequence of tokens, rather than "internalizing" the knowledge.
By varying the sentences, the model instead manages to separate out the knowledge, so to speak. This in turn drastically improves how well they can extract that knowledge later.
[1]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5250633