Hacker News new | ask | show | jobs
by landhar 1263 days ago
> During development, I noticed that the unicode_string library raises an exception when trying to get the next word if there is invalid UTF-8 anywhere in the string. That seemed weird to me, because I would expect the library to only scan the part of the string that is relevant to get the next word. If the library is checking the whole string for UTF-8 validity every time you want to get the length of the next word, that will lead to non-linear algorithmic complexity.

I wonder: could the algorithm implemented in unicode_string be improved? Or were the algorithms implemented as Rust NIFs also prone to this non-linear complexity?

2 comments

I think the answer is yes. See my comment here: https://news.ycombinator.com/item?id=34253876

They are repeatedly calling Regex.split on the trailing part of the string as they move across the string. Regex.split finds all the matches in the string so runs in linear time so you get a quadratic algorithm. The elixir/erlang regex implementation does not check if the string is valid utf8 before running the regex. They were getting the error because the whole string was being evaluated by the regex engine. It probably won't be as fast as the rust solution but it won't be horribly slow on large strings.

I don't know exactly how the word breaking algorithm works under the hood, but the Rust library I am using seems to have linear complexity, so it should be possible to improve unicode_string (if someone is willing to dedicate the time)
As the author of bstr and also the regex implementation that bstr uses to implement word breaking, it is linear time. It deals with invalid UTF-8 as it sees it. When invalid UTF-8 is encountered, it is treated as if it were the replacement codepoint via the "substitution of maximal subparts" strategy. See: https://docs.rs/bstr/latest/bstr/#handling-of-invalid-utf-8

NSFL regex that implements word breaking: https://github.com/BurntSushi/bstr/blob/86947727666d7b21c97e...

Thanks for clarifying. And thanks for the article, it was a great read!