The model needs to be retrained from sctratch for different types of texts. One can release a model trained to generate Trump tweets, but it's of not much use for generating fake news on a specific topic.
Most of the examples don't rhyme. It's unclear to me if this is because most of the original poetry doesn't rhyme so it's just faithfully replicating the lack of rhyme, or if it only partially and accidentally grasps the idea of rhyme.
Some of the ones I like are 'We never say "Thank you"', 'Thy soul, thy very soul is burning!', '"It is morn!” said the clover-bush', 'And they have seen the last light fail', 'There comes a murmur low and sweet'.
Probably the best IMO is 'The sun is gone, and the night is late', but of course everyone will have a different favorite.
Yes, "The sun is gone..." starts out amazingly well. But later fixates on tides for some reason :)
Everything is generated by the 117M model, correct? If so, do you expect the quality to improve for larger models, or is there not enough poetry to train them on? I wonder how much of total poetry is contained in Gutenberg poetry corpus...
It's a mix of OA 117M and 345M at the moment. I haven't observed too much in the way of overfitting yet, so there should still be benefits to going up another 4.4x in model size to 1.5B. My guess is that at 1.5B, it'll start being more important to improve the poetry corpus, since you can already start to see problems with it - the Alexander Pope brokenness and the occasional prose generation of footnotes/commentary are definitely undesirable, and I suspect there would be less 'run on' effect in samples if the original corpus actually properly marked '<|endoftext|>' for each poem...
Maybe I don't understand something about these models. If the model was trained to mimic Trump tweets, it means that someone spent days of GPU time to find the weights of the model. Now if we want it to mimic HN comments, we'd need to spend the same amount of GPU time to find different weights. This is what I meant by "from scratch".
> ... if we want it to mimic HN comments, we'd need to spend the same amount of GPU time ...
These models are often much more general than you seem to be thinking. There's a base model which is incredibly computationally expensive to create from scratch. It is trained on a very large, very general set of data. Then there are specialized versions which are much cheaper to create - you start from the base model that you already have, and you train (much more briefly) on a specific set of data in order to tailor the output.
> Modern image recognition models have millions of parameters. Training them from scratch requires a lot of labeled training data and a lot of computing power (hundreds of GPU-hours or more). Transfer learning is a technique that shortcuts much of this by taking a piece of a model that has already been trained on a related task and reusing it in a new model.