Hacker News new | ask | show | jobs
by hooande 1842 days ago
These results are terrible. Almost all of the generated text posted here is non-sensical, and playing with it online is just confusing.

There really needs to be a better method of evaluation, an MNIST for transformer text generation. like a list of pre-defined prompts that every GPT-X has to use to generate text, which can be scored against a list of "correct" answers in a variety of ways.

I have no way of knowing if the output of this flavor of transformer is good or not, whatever that would mean. Very difficult to see how it compares to similar models. Setting up a model with this number of parameters and their reported training times is impressive. But I have no idea if this particular number of parameters makes a difference, or what that difference is supposed to look like

2 comments

Thankfully, there already exist evaluation tasks like that, and Eleuther actually has a project collecting a handful of them together; see https://github.com/EleutherAI/lm-evaluation-harness/
You have an idea of what's sensical and not. It would seem that long range correlations in the text break down. This is true of all of the popular models, even the trillion parameter ones. They just break down at longer distances.
I suspect this is at least partly because of they way they are used (and maybe trained?)

Always, "given a prompt, keep talking." No instructions to go anywhere, so it's no surprise that they do not.

I think, "start with this idea, end with this one" should give much more interesting results. Telling it to start with a premise and come up with the filler needed to draw some conclusion. It would give it more of a target for making long-distance connections.

Otherwise you just get this open-loop blabbering, which I agree seems really useless. With a more "directed" model I can see this having actual applications, like with story writing or interactive video games. But as it stands this seems totally uninteresting from an applications point of view.