|
|
|
|
|
by Aurornis
249 days ago
|
|
I’ve been putting questions into LLM research functions, including Claude’s research mode, and letting them churn until a report appears. I’ve been starting with topics where I’m already familiar with the answer but want a refreshed. So far, I’m not impressed. Some times the info will be correct. Most of the time it strings together a lot of words from the material it finds but it reads like an undergrad trying to paraphrase the Wikipedia page without understanding the content. Often it will have one bullet point that is completely wrong. The other problem I’m having is that it’s not very good at identifying poor sources. This is less of a problem with topics like math and engineering, but a big problem with topics like health and medicine where it will pick up alternative medicine and pseudoscience pages and integrate them into the research as if they were real. There are a lot of health and medicine topics where the way pseudoscience people talk about a subject doesn’t match the real science, but they use the same words and therefore catch the same search terms. An example is the way “dopamine” is used in casual conversation and by influencers in ways that aren’t accurate. Concepts like “dopamine fasting” or claiming things “raise your dopamine” aren’t scientifically accurate but use the same words nevertheless and therefore can get pulled into the training set and searches. |
|
1) A response originating from LLM pre-training, in a domain where there has not been any (successful) Rl-for-reasoning post-training. In this case the amount of reasoning around the raw facts "recalled" by the LLM is going to be limited by any reasoning present in the training data.
2) A non-agentic response in a domain like Math Olmypiad problems where the LLM was post-trained with RL to encourage reasoning mirroring this RL training set. This type of domain-specific reasoning training seems to have little benefit to other domains (although in the early LLM days it was said that training on computer code did provide some general benefit).
3) An agentic response, such as from one of these research systems, where it seems the agent is following some sort of generic research / summarization template with proscribed steps. I've never tried these myself, but it seems they can be quite successful in deep diving and gathering relevant source material, but then the ability to reason over this retrieved material is going to come down to the reasoning capability of the underlying model per 1) and 2) above.
Bottom line would seem to be that with today's systems domain specific reasoning capability largely comes down to RL post-training for reasoning in that specific domain, resulting in what some call "jagged" performance - excellent in some areas and very poor in others. Demis Hassabis, for one, seems to be saying that this will not be fixed until architectural changes/additions are made to bring us closer to AGI.