| Hmmm.... > As the above samples show, our model is capable of generating samples from a variety of prompts that feel close to human quality and show coherence over a page or more of text. OK, let's look at the sample that's displaying by default: > System Prompt (human-written): Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry. > Model Completion (machine-written, first try): > “You are in good hands, dwarf,” said Gimli, who had been among the first to charge at the orcs; it took only two words before their opponents were reduced to a blood-soaked quagmire, and the dwarf took his first kill of the night. The battle lasted for hours until two of the largest Orcs attempted to overwhelm Aragorn. When they finally stopped, they lay defeated and lifeless for miles and miles. > [Aragorn says something] > “I’ll never forget it!” cried Gimli, who had been in the thick of the battle but hadn’t taken part in it. This is not "close to human quality". It's terrible. Gimli kills an orc in battle... without taking part in the battle. It takes two words before the opponents (as opposed to, say, the battlefield) are reduced to a "blood-soaked quagmire", but the battle lasts for hours after that. After which two orcs lay defeated and lifeless for miles and miles. This isn't even coherent from one sentence to the next. And paragraph three directly contradicts paragraph one. And Gimli calls Legolas a dwarf! |
> As the above samples show, our model is capable of generating samples from a variety of prompts that feel close to human quality and show coherence over a page or more of text. Nevertheless, we have observed various failure modes, such as repetitive text, world modeling failures (e.g. the model sometimes writes about fires happening under water), and unnatural topic switching. Exploring these types of weaknesses of language models is an active area of research in the natural language processing community.
The authors go on to discuss more limitations (for example, the dataset doesn’t contain much outside of LOtR and some celebrities). I imagine that what the authors call “coherence” is weaker than what you are referring to (the AI is not necessarily telling a story, but it stays on the same topic / characters).
I still think that the result is incredibly impressive and powerful. You could start with this as a sort of English “noise”, and then run the result through a parser. This would allow you to add some “hard coded” world modeling or constraints. Ex: Maybe you could mix in sentiment analysis and reject some sentences to roughly control the narrative.