| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by landofredwater 1392 days ago
	What would the mechanism behind case-insesitive string compare be? How would you program out all the edge cases?

3 comments

umanwizard 1392 days ago

By implementing the Unicode’s case folding specification in detail, as described in section 5.18 of the Unicode standard.

Or, more likely, by using a library like ICU.

link

silon42 1392 days ago

it is host specific.

link

bmn__ 1392 days ago

Unicode does not depend on a host. You probably meant something else, and the expression came out wrong.

Care to explain in more detail?

link

PeterisP 1392 days ago

As Unicode standard describes (e.g. the same 5.18 section mentioned above) case mapping depends on locale, so lowercasing the same string may have different results on different hosts, and so also the truthfulness of lowercase(x)==lowercase(y) is not universal and depends on the host locale.

See the standard https://www.unicode.org/versions/Unicode11.0.0/ch05.pdf for the most commonly used example of Turkish i, but there are others.

link

Dylan16807 1392 days ago

If you're setting up proper case folding, part of your job is not leaving locale up to the host.

link

Quekid5 1392 days ago

Indeed, a fundamental to the problem is that most unicode text doesn't actually carry the relevant locale information... (Of course, one probably wouldn't want to rely on sender-specified locales for email adresses when deciding address equality -- that would open one up to all sorts of potential weird scenarios, i.e. a nightmare for security).

link

connicpu 1392 days ago

Presumably the case-insensitive version is also doing unicode normalization as well, which is what a byte-level comparison of tolower versions would miss

link

kubanczyk 1392 days ago

A primer (taken straight from GP's first link):

> the full case foldings are superior: for example, they allow "MASSE" and "Maße" to match.

link

skupig 1392 days ago

This sounds like a fun vulnerability to find in a password reset flow

link

thrdbndndn 1392 days ago

Yeah but isn't email address in ascii? I still have no idea why it would be different.

link

umanwizard 1392 days ago