Hacker News new | ask | show | jobs
by shellac 635 days ago
The (standard) ripgrep regex engine has full unicode support. My reading of that is that it should handle such equivalences like matching the decomposed version.
1 comments

It does not. Almost no regex engine does that.

To add more color to this, the precise details of what "Unicode support" means are documented here: https://github.com/rust-lang/regex/blob/master/UNICODE.md

In effect, all of UTS#18 Level 1 is covered with a couple caveats. This is already a far cry better than most regex engines, like PCRE2, which has limited support for properties and no way to do subtraction or intersection of character classes. Other regex engines, like Javascript, are catching up. While UTS#18 Level 1 make ripgrep's Unicode support better than most, it does not make it the best. The third party Python `regex` library, for example, has very good support, although it is not especially fast[1].

Short of building UTS#18 2.1[2] support into the regex engine (unlikely to ever happen), it's likely ripgrep could offer some sort of escape hatch. Perhaps, for example, an option to normalize all text searched to whatever form you want (nfc, nfd, nfkc or nfkd). The onus would still be on you to write the corresponding regex pattern though. You can technically do this today with ripgrep's `--pre` flag, but having something built-in might be nice. Indeed, if you read UTS#18 2.1, you'll note that it is self-aware about how difficult matching canonical equivalents is, and essentially suggests this exact work-around instead. The problem is that it would need to be opt-in and the user would need to be aware of the problem in the first place. That's... a stretch, but probably better than nothing.

[1]: https://github.com/BurntSushi/rebar?tab=readme-ov-file#summa...

[2]: https://unicode.org/reports/tr18/#Canonical_Equivalents

Thanks very much for clarifying that. It did seem unlikely: I remember NSString (ask you parents...) supported this level of unicode equivalence, and it was quite a burden. Normalising does feel like the only tractable method here, and if you have an extraction pipeline anyway (in rga) maybe it's not so bad.
Yes, rga could support this in a more streamlined manner than rg, since rga has that extraction pipeline with caching. ripgrep just has a hook for an extraction pipeline.
For the purpose of searching, wouldn’t it be sufficient to do NFC normalization for text? Could hide that behind a command line flag even…
Can you say how that differs from what I suggested in my last paragraph? I legitimately can't tell if you're trying to suggest something different or not.

As UTS#18 2.1 says, it isn't sufficient to just normalize the text you're searching. It also means the user has to craft their regex appropriately. If you normalize to NFC but your regex uses NFD, oops. So it's probably best to expose a flag that lets you pick the normalization form.

And yes, it would have to be behind a CLI flag. Always doing normalization would likely make ripgrep slower than a naive grep written in Python. Yes. That bad. Yes. Really. And it might now become clear why a lot of tools don't do this.