Hacker News new | ask | show | jobs
by haldujai 1158 days ago
I wonder if the better question is not how we get more training data but:

If we're running out of training data with hallucinations and performance remaining so inadequate (per OpenAI's whitepaper) is an autoregressive transformer the right architecture?

Perhaps ongoing work in finetuning will take these models to the next level but ignoring the LLM hype it really does seem like things have plateaued for a while now (with expected gains from scaling).

2 comments

There is still an order of magnitude more organic text. Ilya Sutskever recently said it was still ok. After that, we got to use reinforcement learning (agent GPTs with tools) to generate and self-validate more examples.

One "simple" application would be to build a full index of facts in the whole training corpus. Just pass each document to GPT and ask it to extract the facts. Then create an inverted index, with each fact and its references. This will allow us to generate a wikipedia-like corpus of exhaustive fact research. We can say if a fact is known or not, we can tell if it is settled or controversial, and if it is a preference we can tell what is the distribution. This has got to help with factuality and generate lots of text to feed the model. Basically only costs electricity and GPU. It nicely side-steps the problem of truth by simply modelling the empirical distribution in an explicit way. At least the model won't hallucinate outside the known facts.

After the low hanging fruit - the high quality data such as scientific papers, libgen, stackexchange, wikipedia, etc — has been exhausted, that’s it. There’s no more data of that kind. There’s not 9 other wikipedias or 9 other libgens. There is only a certain quantity of high-quality codified knowledge in existence and models need to be able to deal with that constraint. Feeding it more and more lower quality text is not going to improve performance because we already fed it all the text that we use. There’s a reason that PhDs don’t involve reading tumblr all day.
> There’s not 9 other wikipedias

By the way, I wonder how much you could get from "history" data: wikipedia history pages, talk pages, commits diffs on github, pull request discussions, etc.

AFAIK so far we've only been using the finished code "artifacts", but if we're desperate for more tokens to train on, we might get a lot of mileage from just "all different versions of this dataset over time".

There's a reason there are so many review papers - which are just synthesis of a topic in a certain period of time. Second order analysis is useful content, not junk. It can cross reference facts and detect inconsistencies. Combining multiple sources can lead to new insights and learning the trends.
>After that, we got to use reinforcement learning (agent GPTs with tools) to generate and self-validate more examples.

How would you "self-validate" against hallucinated facts?

What makes self-validation possible are hard external rules that can be evaluated independently and automatically. Like the rules of Chess or Go.

We don't have anything like that for LLMs and what people want to use them for.

RLHF seems to suggest that human feedback to tune the model after plain textual data pretraining is quite potent per sample. There might be some optimal ratio of data+model size:rlhf size that works quite favorably for us in getting hallucinations to a minimum. Furthermore there might be some “there” there, in the hallucinations, that has yet to be identified as valuable in itself. Either way it seems like our ability to wrangle these models is getting better
> There is still an order of magnitude more organic text.

Posing this as a thought experiment, agree we still have more data to go. That we are wondering about this suggests that the current approach may be inadequate, i.e. it should not take petabytes of data for a LLM to match the performance of a high school student (for the LLM = AGI folks).

> One "simple" application would be to build a full index of facts in the whole training corpus. Just pass each document to GPT and ask it to extract the facts.

Agree, KG+LLM is a good next step to explore and should address some hallucination issues (see DRAGON from Leskovec and Liang groups). But we're already now talking about architectural changes as I posited.

In any case, where do we get such knowledge graphs (or index of facts)? Some already exist (e.g. Wiki, UMLS) and were created by humans but are clearly inadequate in coverage.

The proposition of using GPT-like models to generate these (i.e. GraphGPT) seems conceptually flawed as GPT does not itself know if a statement is factual or not which is problematic even for humans.

Settled vs controversial is orders of magnitude more complex, how on earth do we do this without human annotation? You can't rely on frequency (i.e. some things were facts for 100 years but all of a sudden they're not anymore and this is not controversial by definition).

The only reason LLMs work as well as they do now is because sheer volume of data (and NTP) makes the noise seem hidden and by definition an autoregressive model should be somewhat impervious to singular factoids (vs a model being grounded by the garbage dump that is CommonCrawl/the internet).

> At least the model won't hallucinate outside the known facts.

Not sure this is a given, even if a model acts as a natural language database of factoids it is probable that it will hallucinate links unless you're strictly grounding output in which case we've just built a colossally over-engineered IR/STS tool.

> One "simple" application

I think what you've posited is actually harder to build than anything that's been achieved thus far with LLMs.

I think the model is almost ready for real-world learning, for example in programming - sure you can have a knowledge base built on documentation, public code, etc.

But at some point you can just give the model access to tools, tell it to solve some problems, build plans, generate logs of each approach and train on those outputs. Programming is ripe for this - all the tools are easily accessible to a digital actor, everything is suited to text based model, there's plenty of tooling to provide feedback and explanations for errors geared towards humans.

No need to fumble with robotics and physical world - you can create a superhuman programmer. Then make it build out the infrastructure for physical world learning. AGI apocalypse here we come !

We also used to train neural networks over multiple 'epochs' of the same data.

Can't we keep doing that again?

We had techniques like drop-out and data augmentation to help.

I don't think that the hallucinations have anything to do with the architecture, rather they come from optimizing a cost function where saying "I don't know" is as bad as being wrong. I do not think that RLHF as currently understood can fix this, since the reward model would struggle to distinguish fact from fiction.
I think you are mixing up layers of abstraction.

The network is most likely trained with something like a categorical cross entropy loss function. Those totally punish being wrong a lot more than saying "I don't know". See https://www.v7labs.com/blog/cross-entropy-loss-guide

It's just that saying "I don't know" means that your model is spreading the probability of what the next token in the text stream might be over many different outcomes. A very 'uniform' probability distribution, instead of sharp prediction.

That looks very different to GPT literally outputting the words "I don't know".

Sorry if I was unclear. I know that the model is incentivised to accurately predict the probability distribution of the next token. I mean that the model is not being incentivised to literally produce the output tokens corresponding to "I don't know" when asked a question where it is uncertain.
Yes, exactly.

What I wanted to emphasize is that the training _does_ actually incentivize the model to say "I don't know" but on a lower level.

If only the OpenAI api gave us the token probabilities like it used to.