|
|
|
|
|
by landhar
1263 days ago
|
|
> During development, I noticed that the unicode_string library raises an exception when trying to get the next word if there is invalid UTF-8 anywhere in the string. That seemed weird to me, because I would expect the library to only scan the part of the string that is relevant to get the next word. If the library is checking the whole string for UTF-8 validity every time you want to get the length of the next word, that will lead to non-linear algorithmic complexity. I wonder: could the algorithm implemented in unicode_string be improved? Or were the algorithms implemented as Rust NIFs also prone to this non-linear complexity? |
|
They are repeatedly calling Regex.split on the trailing part of the string as they move across the string. Regex.split finds all the matches in the string so runs in linear time so you get a quadratic algorithm. The elixir/erlang regex implementation does not check if the string is valid utf8 before running the regex. They were getting the error because the whole string was being evaluated by the regex engine. It probably won't be as fast as the rust solution but it won't be horribly slow on large strings.