Hacker News new | ask | show | jobs
by nine_k 2723 days ago
I'd posit that this is not universally true. It may be true for English names.

It may be implementable for e.g. French names: a name like "François" may be stored as UTF-8 with "ç" always represented as a composite pair (or always a single character), and the application layer knows and uses this.

It must be a dubious idea for German names when you may need to see "Müller" and "Mueller" as the same name, but also keep e.g. only "Mueller" as the form referenced in legal papers.

Once you start considering anything non-ASCII, things become hairy. (Of course, if you only work for the US market and don't care about any international travelers as customers, you can continue using these introductory courses as a guidance.)

1 comments

Of course it's hairy - that's kind of the point of this whole discussion!

People who use databases naturally end up wanting to store Unicode strings, and people naturally want to query them in locale-sensitive ways. Your Mueller example is a fine example of this. Databases have deep support for this, and you can also build your own compromise, so to speak, on top of the database with ICU or another library if it doesn't meet your needs.

Storing is fine!

Indexing is full of caveats, though.

Indexing is always full of caveats. :-)