|
|
|
|
|
by kevmo314
358 days ago
|
|
Nice work! I tried something similar a while back ago: https://github.com/kevmo314/tokie The takeaway I also found was that the running cost was really dominated by pretokenization (the regex). It's cool to see that you found a faster way to run the regex, but have you tried comparing the performance of just swapping out the regex engine and leaving the actual BPE to tiktoken? I wonder if that is upstreamable? |
|
I've reached out to the guy who maintains Tiktoken to talk about this.