Hacker News new | ask | show | jobs
by umanwizard 1392 days ago
By implementing the Unicode’s case folding specification in detail, as described in section 5.18 of the Unicode standard.

Or, more likely, by using a library like ICU.

1 comments

it is host specific.
Unicode does not depend on a host. You probably meant something else, and the expression came out wrong.

Care to explain in more detail?

As Unicode standard describes (e.g. the same 5.18 section mentioned above) case mapping depends on locale, so lowercasing the same string may have different results on different hosts, and so also the truthfulness of lowercase(x)==lowercase(y) is not universal and depends on the host locale.

See the standard https://www.unicode.org/versions/Unicode11.0.0/ch05.pdf for the most commonly used example of Turkish i, but there are others.

If you're setting up proper case folding, part of your job is not leaving locale up to the host.
Indeed, a fundamental to the problem is that most unicode text doesn't actually carry the relevant locale information... (Of course, one probably wouldn't want to rely on sender-specified locales for email adresses when deciding address equality -- that would open one up to all sorts of potential weird scenarios, i.e. a nightmare for security).