| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by moyix 1939 days ago
	Yeah, I trained my own BPE tokenizer for this and it results in pretty good compression. From 1024 BPE tokens you can generate anywhere from 2000-6000 actual characters of text. My guess is that it's a bit more efficient than English-BPE because there's a lot of repetitive stuff in source code (think spaces for indentation, or "if("/"while("/"for (int").