|
|
|
|
|
by randcraw
724 days ago
|
|
This suggestion revisits the classic "formal top-down" vs "informal bottom-up" approaches to building a semantic knowledge management system. Top-down has been tried extensively in the pre-big-data models and pre-probabilistic models era, but required extensive manual human curation while being starved for knowledge. The rise of big-data bode no cure for the curation problem. Because its curation can't be automated, larger scale just made the problem worse. AI's transition to probability (in the ~1990s) paved the way to the associative probabilistic models in vogue today, and there's no sign that a more-curated more-formal approach has any hope of outcompeting them. How to extend LLMs to add mechanisms for reasoning, causality, etc (Type 2 thinking)? However that will eventually be done, the implementation must continue to be probabilistic, informal, and bottom-up. Manual human curation of logical and semantic relations into knowledge models has proven itself _not_ to be sufficiently scalable or anti-brittle to do what's needed. |
|
We could just use RAG to create a new dataset. Take each known concept or named entity, search it inside the training set (1), search it on the web (2), generate it with a bunch of models in closed book mode (3).
Now you got three sets of text, put all of them in a prompt and ask for a wikipedia style article. If the topic is controversial, note the controversy and distribution of opinions. If it is settled, notice that too.
By contrasting web search with closed-book materials we can detect biases in the model and lacking knowledge or skills. If they don't appear in the training set you know what is needed in the next iteration. This approach combines self testing with topic focused research to integrate information sitting across many sources.
I think of this approach as "machine study" where AI models interact with the text corpus to synthesize new examples, doing a kind of "review paper" or "wiki" reporting. This can be scaled for billions of articles, making a 1000x larger AI wikipedia.
Interacting with search engines is just one way to create data with LLMs. Interacting with code execution and humans are two more ways. Just human-AI interaction alone generates over one billion sessions per month, where LLM outputs meet with implicit human feedback. Now that most organic sources of text have been used, the LLMs will learn from feedback, task outcomes and corpus study.