|
|
|
|
|
by moyix
1939 days ago
|
|
Yeah, I trained my own BPE tokenizer for this and it results in pretty good compression. From 1024 BPE tokens you can generate anywhere from 2000-6000 actual characters of text. My guess is that it's a bit more efficient than English-BPE because there's a lot of repetitive stuff in source code (think spaces for indentation, or "if("/"while("/"for (int"). |
|