| HN Mirror

As the author of bstr and also the regex implementation that bstr uses to implement word breaking, it is linear time. It deals with invalid UTF-8 as it sees it. When invalid UTF-8 is encountered, it is treated as if it were the replacement codepoint via the "substitution of maximal subparts" strategy. See: https://docs.rs/bstr/latest/bstr/#handling-of-invalid-utf-8

NSFL regex that implements word breaking: https://github.com/BurntSushi/bstr/blob/86947727666d7b21c97e...