Hacker News new | ask | show | jobs
by moyix 1939 days ago
Yeah, I trained my own BPE tokenizer for this and it results in pretty good compression. From 1024 BPE tokens you can generate anywhere from 2000-6000 actual characters of text. My guess is that it's a bit more efficient than English-BPE because there's a lot of repetitive stuff in source code (think spaces for indentation, or "if("/"while("/"for (int").