|
|
|
|
|
by haldujai
1158 days ago
|
|
I wonder if the better question is not how we get more training data but: If we're running out of training data with hallucinations and performance remaining so inadequate (per OpenAI's whitepaper) is an autoregressive transformer the right architecture? Perhaps ongoing work in finetuning will take these models to the next level but ignoring the LLM hype it really does seem like things have plateaued for a while now (with expected gains from scaling). |
|
One "simple" application would be to build a full index of facts in the whole training corpus. Just pass each document to GPT and ask it to extract the facts. Then create an inverted index, with each fact and its references. This will allow us to generate a wikipedia-like corpus of exhaustive fact research. We can say if a fact is known or not, we can tell if it is settled or controversial, and if it is a preference we can tell what is the distribution. This has got to help with factuality and generate lots of text to feed the model. Basically only costs electricity and GPU. It nicely side-steps the problem of truth by simply modelling the empirical distribution in an explicit way. At least the model won't hallucinate outside the known facts.