Hacker News new | ask | show | jobs
by minimaxir 1976 days ago
As someone who maintains a package to both make it easy to fine-tune GPT-2 or create your own from scratch (https://github.com/minimaxir/aitextgen), this submission is a good run-through of the technical considerations toward building a GPT-2 model.

It's both substantially easier and faster than it was when OpenAI released their paper in 2019, thanks to both Huggingface Transformers and Tokenizers making the architectures more efficient and other companies streamlining the training process and make it more efficient for all parts in the pipeline.

You don't need a TPU cluster to train a working GPT-2 model, although it helps (unfortunately TPU support on PyTorch-based training like aitextgen is more fussy). A free GPU on Colab gets you most of the way, especially since you can get now a T4 or a V100 which lets you use FP16.

4 comments

Yep i started off with trying to get it to work with pytorch (https://github.com/bkkaggle/lm-training-research-project/blo...) then with pt-lightning but the whole 1 user VM per TPU board limitation in pytorch-xla 7-8 months ago made me switch over to TF
Just as Google want you to do. Within 3-5 years you will probably see a high price increase and no where to go.
heh. I've been using jax for a couple of months and its been a pretty nice replacement of both pt and tf. it feels like what a ml framework would look like if it were built around easy scaling and dev friendliness.
> You don't need a TPU cluster to train a working GPT-2 model [...] A free GPU on Colab gets you most of the way

I have a hard time believing you can really train it with 1 V-100, unless you are talking about an extremely scale down version of GPT-2 (large).

If you can train it at all it would be with a batch size so small (probably 1?) that it would hurt the performance and it would take months.

I am out of the loop somehow?

Edit: I was thinking about reproducing the training that OpenAI did in their paper, so redoing all the pre-training, but I realized you might have been talking about training on a smaller custom dataset.

also, he just be talking about training a much smaller model than the 1.5B one, because that would take years maybe otherwise
What do you think would be necessary to generate rhyming text with a particular phrasing / rhythm?

e.g. in the style of a particular rapper?

If you just fine-tune on a corpus of their lyrics, you might miss the underlying poetic constraints.

If there were an additional prior (a "poetry / assonance / rhyme" model), what is the easiest way to constrain generation to respect this prior?

Thanks!

I wrote "Stylistic Rhyme-bound Poetry Generation or: How You Too Can Generate Sonnets in the Style of Kanye West" [1] back in 2017 for an easy DIY introduction to this topic. You specify the rhyming scheme (ABAB CDCD etc) and it forces end-line rhymes around it.

It uses Markov chains instead of GPT-2, but the approach should work with prompt-based things like GPT-2 also: for lines that are "free" (e.g. no specific word you need to rhyme with), you can generate the line normally -- but for lines you need to rhyme with a specific word, you can just generate last-word-first and generate backwards. For a strictly LTR prompt like GPT-2, you could probably just reverse your corpus word order, generate "reverse" lines with GPT-2 given the previous line + word you need to rhyme with as the prompt, and then reverse it back to "normal" in postprocessing.

[1] https://festivalpeak.com/stylistic-rhyme-bound-poetry-genera...

Some examples of the output of this approach:

[2] https://medium.com/words-of-mimicry/kanye-west-ballade-1-a6f...

[3] https://medium.com/words-of-mimicry/me-you-and-slow-sip-slow...

I'd expect the output to be better with something like GPT-2/3, since Markov chains are so twentieth-century, but I was pretty happy at the output quality even though it often rhymed the same word repeatedly; you could improve it by weighting previously-used words, removing them from the pool of rhyming words, and/or backtracking to previous lines when you find yourself without other words to rhyme.

A paper was recently released for that particular use case (https://github.com/markriedl/weirdai), in which it describes a number of technical caveats (and it's technically not using GPT-2).

I do think it's possible to train a GPT-2-esque network to do something similar, albeit with some text encoding shenanigans.

As far as I know to get a V100 you need Colab Pro? Did this change recently?
It's unclear. I've heard people get the V100 without Colab Pro. Albeit I do use Colab Pro and get a V100 almost each time.

As an aside, if you do get a V100, Colab Pro is by-far the cheapest way to train an AI model. ($10/mo is much, much cheaper than $2.48+/hr on GCP normally!) Although you need to sync checkpoints to off-loaded storage in case the Notebook dies.

> As an aside, if you do get a V100, Colab Pro is by-far the cheapest way to train an AI model.

But others should be aware that you get what you pay for. Google still rate limited me when I used Colab Pro, and I ran into a myriad of other small problems. If that's all one is willing to spend to play with AI, 100% go for it. It's a great place to start. But if you're at all serious and can afford it, I think a local machine with a modest GPU is worth every penny.

Curious; is it better to train locally on something like a 2080ti 11G or go for colab and offload checkpoints to S3?

Asking because it seems V100 performance (or the other colab paid GPU) is worth the occasional instability if you’ve set up checkpoints.

Look under "FP16 16-bit (Half Precision) Floating Point Calculations" on https://www.microway.com/knowledge-center-articles/compariso...

These raw numbers don't tell the whole story, of course. But IMHO, the convenience of a local 2080Ti outweighs the speed benefits of an _somewhat flaky_ V100 via Colab for day-to-day use (unless memory size is an issue, which you can't really get around).

OTOH, for just trying out stuff / one-offs, Colab is perfect - and bonus points if you score a V100.

Alas, only if you live in the US.

Colab Pro isn't available outside the US (without breaking Google's terms).

US and Canada.