|
|
|
|
|
by justinhj
352 days ago
|
|
I've been playing with tokenization too. Starting from Kaparthy's Python minbpe I set myself the task of training a tokenizer on wikitext (500mb) in a reasonable time.
I got the C++ version down to about 50 minutes compared to the original Python code (estimated) several months. Haven't really spent much time
looking at encode and decode but I plan to incorporate these regex modifications when I do! https://github.com/justinhj/minbpe-cc |
|