Hacker News new | ask | show | jobs
by kevmo314 358 days ago
Nice work! I tried something similar a while back ago: https://github.com/kevmo314/tokie

The takeaway I also found was that the running cost was really dominated by pretokenization (the regex). It's cool to see that you found a faster way to run the regex, but have you tried comparing the performance of just swapping out the regex engine and leaving the actual BPE to tiktoken? I wonder if that is upstreamable?

2 comments

Cool!

I've reached out to the guy who maintains Tiktoken to talk about this.

There is at least some awareness already when it comes to the performance of the regex engine:

https://github.com/openai/tiktoken/blob/main/src/lib.rs#L95-...