| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by landhar 1263 days ago
	> During development, I noticed that the unicode_string library raises an exception when trying to get the next word if there is invalid UTF-8 anywhere in the string. That seemed weird to me, because I would expect the library to only scan the part of the string that is relevant to get the next word. If the library is checking the whole string for UTF-8 validity every time you want to get the length of the next word, that will lead to non-linear algorithmic complexity. I wonder: could the algorithm implemented in unicode_string be improved? Or were the algorithms implemented as Rust NIFs also prone to this non-linear complexity?

2 comments

benmmurphy 1262 days ago

I think the answer is yes. See my comment here: https://news.ycombinator.com/item?id=34253876

They are repeatedly calling Regex.split on the trailing part of the string as they move across the string. Regex.split finds all the matches in the string so runs in linear time so you get a quadratic algorithm. The elixir/erlang regex implementation does not check if the string is valid utf8 before running the regex. They were getting the error because the whole string was being evaluated by the regex engine. It probably won't be as fast as the rust solution but it won't be horribly slow on large strings.

link

wofo 1263 days ago

I don't know exactly how the word breaking algorithm works under the hood, but the Rust library I am using seems to have linear complexity, so it should be possible to improve unicode_string (if someone is willing to dedicate the time)

link

burntsushi 1263 days ago

As the author of bstr and also the regex implementation that bstr uses to implement word breaking, it is linear time. It deals with invalid UTF-8 as it sees it. When invalid UTF-8 is encountered, it is treated as if it were the replacement codepoint via the "substitution of maximal subparts" strategy. See: https://docs.rs/bstr/latest/bstr/#handling-of-invalid-utf-8

NSFL regex that implements word breaking: https://github.com/BurntSushi/bstr/blob/86947727666d7b21c97e...

link

landhar 1263 days ago

Thanks for clarifying. And thanks for the article, it was a great read!

link