| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by minimaxir 1976 days ago

As someone who maintains a package to both make it easy to fine-tune GPT-2 or create your own from scratch (https://github.com/minimaxir/aitextgen), this submission is a good run-through of the technical considerations toward building a GPT-2 model.

It's both substantially easier and faster than it was when OpenAI released their paper in 2019, thanks to both Huggingface Transformers and Tokenizers making the architectures more efficient and other companies streamlining the training process and make it more efficient for all parts in the pipeline.

You don't need a TPU cluster to train a working GPT-2 model, although it helps (unfortunately TPU support on PyTorch-based training like aitextgen is more fussy). A free GPU on Colab gets you most of the way, especially since you can get now a T4 or a V100 which lets you use FP16.

4 comments

bkkaggle 1976 days ago

Yep i started off with trying to get it to work with pytorch (https://github.com/bkkaggle/lm-training-research-project/blo...) then with pt-lightning but the whole 1 user VM per TPU board limitation in pytorch-xla 7-8 months ago made me switch over to TF

punnerud 1976 days ago

Just as Google want you to do. Within 3-5 years you will probably see a high price increase and no where to go.

bkkaggle 1976 days ago

heh. I've been using jax for a couple of months and its been a pretty nice replacement of both pt and tf. it feels like what a ml framework would look like if it were built around easy scaling and dev friendliness.

sailingparrot 1976 days ago

> You don't need a TPU cluster to train a working GPT-2 model [...] A free GPU on Colab gets you most of the way

I have a hard time believing you can really train it with 1 V-100, unless you are talking about an extremely scale down version of GPT-2 (large).

If you can train it at all it would be with a batch size so small (probably 1?) that it would hurt the performance and it would take months.

I am out of the loop somehow?

Edit: I was thinking about reproducing the training that OpenAI did in their paper, so redoing all the pre-training, but I realized you might have been talking about training on a smaller custom dataset.

make3 1976 days ago

also, he just be talking about training a much smaller model than the 1.5B one, because that would take years maybe otherwise

bravura 1976 days ago

What do you think would be necessary to generate rhyming text with a particular phrasing / rhythm?

e.g. in the style of a particular rapper?

If you just fine-tune on a corpus of their lyrics, you might miss the underlying poetic constraints.

If there were an additional prior (a "poetry / assonance / rhyme" model), what is the easiest way to constrain generation to respect this prior?

Thanks!

drusepth 1976 days ago

I wrote "Stylistic Rhyme-bound Poetry Generation or: How You Too Can Generate Sonnets in the Style of Kanye West" [1] back in 2017 for an easy DIY introduction to this topic. You specify the rhyming scheme (ABAB CDCD etc) and it forces end-line rhymes around it.

It uses Markov chains instead of GPT-2, but the approach should work with prompt-based things like GPT-2 also: for lines that are "free" (e.g. no specific word you need to rhyme with), you can generate the line normally -- but for lines you need to rhyme with a specific word, you can just generate last-word-first and generate backwards. For a strictly LTR prompt like GPT-2, you could probably just reverse your corpus word order, generate "reverse" lines with GPT-2 given the previous line + word you need to rhyme with as the prompt, and then reverse it back to "normal" in postprocessing.

[1] https://festivalpeak.com/stylistic-rhyme-bound-poetry-genera...

Some examples of the output of this approach:

[2] https://medium.com/words-of-mimicry/kanye-west-ballade-1-a6f...

[3] https://medium.com/words-of-mimicry/me-you-and-slow-sip-slow...

I'd expect the output to be better with something like GPT-2/3, since Markov chains are so twentieth-century, but I was pretty happy at the output quality even though it often rhymed the same word repeatedly; you could improve it by weighting previously-used words, removing them from the pool of rhyming words, and/or backtracking to previous lines when you find yourself without other words to rhyme.

minimaxir 1976 days ago

A paper was recently released for that particular use case (https://github.com/markriedl/weirdai), in which it describes a number of technical caveats (and it's technically not using GPT-2).

I do think it's possible to train a GPT-2-esque network to do something similar, albeit with some text encoding shenanigans.

FL33TW00D 1976 days ago

As far as I know to get a V100 you need Colab Pro? Did this change recently?

minimaxir 1976 days ago

It's unclear. I've heard people get the V100 without Colab Pro. Albeit I do use Colab Pro and get a V100 almost each time.

As an aside, if you do get a V100, Colab Pro is by-far the cheapest way to train an AI model. ($10/mo is much, much cheaper than $2.48+/hr on GCP normally!) Although you need to sync checkpoints to off-loaded storage in case the Notebook dies.

fpgaminer 1976 days ago

> As an aside, if you do get a V100, Colab Pro is by-far the cheapest way to train an AI model.

But others should be aware that you get what you pay for. Google still rate limited me when I used Colab Pro, and I ran into a myriad of other small problems. If that's all one is willing to spend to play with AI, 100% go for it. It's a great place to start. But if you're at all serious and can afford it, I think a local machine with a modest GPU is worth every penny.

nsomaru 1976 days ago

Curious; is it better to train locally on something like a 2080ti 11G or go for colab and offload checkpoints to S3?

Asking because it seems V100 performance (or the other colab paid GPU) is worth the occasional instability if you’ve set up checkpoints.

mdda 1976 days ago

Look under "FP16 16-bit (Half Precision) Floating Point Calculations" on https://www.microway.com/knowledge-center-articles/compariso...

These raw numbers don't tell the whole story, of course. But IMHO, the convenience of a local 2080Ti outweighs the speed benefits of an _somewhat flaky_ V100 via Colab for day-to-day use (unless memory size is an issue, which you can't really get around).

OTOH, for just trying out stuff / one-offs, Colab is perfect - and bonus points if you score a V100.

byefruit 1976 days ago

Alas, only if you live in the US.

Colab Pro isn't available outside the US (without breaking Google's terms).

infinite8s 1975 days ago

US and Canada.