Hacker News new | ask | show | jobs
by wofo 1269 days ago
I don't know exactly how the word breaking algorithm works under the hood, but the Rust library I am using seems to have linear complexity, so it should be possible to improve unicode_string (if someone is willing to dedicate the time)
2 comments

As the author of bstr and also the regex implementation that bstr uses to implement word breaking, it is linear time. It deals with invalid UTF-8 as it sees it. When invalid UTF-8 is encountered, it is treated as if it were the replacement codepoint via the "substitution of maximal subparts" strategy. See: https://docs.rs/bstr/latest/bstr/#handling-of-invalid-utf-8

NSFL regex that implements word breaking: https://github.com/BurntSushi/bstr/blob/86947727666d7b21c97e...

Thanks for clarifying. And thanks for the article, it was a great read!