Hacker News new | ask | show | jobs
by minimaxir 1939 days ago
Note that the encoded dataset (through the GPT-2 BPE tokenizer) will be much, much less than 17GB, both on disk and in memory (in my experience it can be anywhere from 1/3rd to 1/2 the size)

If finetuning an existing GPT-2 model, you must use that BPE tokenizer; you could theoretically use your own but that wouldn't make a difference performance-wise and you'd just have a lot of wasted tokenspace.

The efficiencies of using your own tokenizer for bespoke, esoteric content that does not match typical internet speak (like this) are why I recommend training your own tokenizer and GPT-2 from scratch if possible.

1 comments

Yeah, I trained my own BPE tokenizer for this and it results in pretty good compression. From 1024 BPE tokens you can generate anywhere from 2000-6000 actual characters of text. My guess is that it's a bit more efficient than English-BPE because there's a lot of repetitive stuff in source code (think spaces for indentation, or "if("/"while("/"for (int").