Hacker News new | ask | show | jobs
by justinhj 352 days ago
I've been playing with tokenization too. Starting from Kaparthy's Python minbpe I set myself the task of training a tokenizer on wikitext (500mb) in a reasonable time. I got the C++ version down to about 50 minutes compared to the original Python code (estimated) several months.

Haven't really spent much time looking at encode and decode but I plan to incorporate these regex modifications when I do!

https://github.com/justinhj/minbpe-cc