Hacker News new | ask | show | jobs
by rraval 2720 days ago
> the solution for which is to use icu

We ran headfirst into this issue at my company and we've actually been recommending the opposite (use the "C" locale on the database, treat collation as a render level concern).

I have a whole write up explaining the technical motivations behind that recommendation: https://gist.github.com/rraval/ef4e4bdc63e68fe3e83c9f98f56af...

4 comments

It's easy to say "treat collation as a render level concern", but this doesn't really work efficiently when the rendering component wants to query the database using this index, does it?

That is, to do anything you want to do a locale sensitive way, such as querying the database for a given case insensitive string, or pagination, you'll need to have the DB index be locale/collation aware or else return every possible value and the renderer sort it out.

As an example, how you would you repeatedly return a range of results based on a string in a locale-aware way, e.g., to display paged results, if you defer the work to the renderer?

The only general solution I'm aware of which lets you "bypass" the DB collation is to use a locale-aware collation library like ICU to generate a binary sort key, which can be compared using plain binary comparison and storing those in the DB. This still means the overall index is locale aware, but the DB doesn't need to be aware of any collation rules: only the code that generates queries and handles the results needs to do the sort key transformation.

It means that you have a single library that does all the conversion, which you can probably control more easily, rather than delegating this to the database, where you might need to support several vendors or at least various versions (and problems can arise even within a single DB version as this postgres issue shows).

If you build a plain index over Unicode strings, chances are you're doing it wrong.

Normal DB indexes are mostly for numbers and (short) ASCII strings. (Something like canonicalized UTF-8 is an edge case.) For strings that have encodings and locales, you likely need a full-text index, provided by your DB or by something like Solr / ElasticSearch. It addresses the oddities of human-oriented texts better.

I'm sorry but no.

It's a pretty common use case (so common that it is usually taught in an introduction to databases) that you might want an index that allows you to efficiently search by non-ASCII strings like say, a person's last name.

I'd posit that this is not universally true. It may be true for English names.

It may be implementable for e.g. French names: a name like "François" may be stored as UTF-8 with "ç" always represented as a composite pair (or always a single character), and the application layer knows and uses this.

It must be a dubious idea for German names when you may need to see "Müller" and "Mueller" as the same name, but also keep e.g. only "Mueller" as the form referenced in legal papers.

Once you start considering anything non-ASCII, things become hairy. (Of course, if you only work for the US market and don't care about any international travelers as customers, you can continue using these introductory courses as a guidance.)

Of course it's hairy - that's kind of the point of this whole discussion!

People who use databases naturally end up wanting to store Unicode strings, and people naturally want to query them in locale-sensitive ways. Your Mueller example is a fine example of this. Databases have deep support for this, and you can also build your own compromise, so to speak, on top of the database with ICU or another library if it doesn't meet your needs.

Storing is fine!

Indexing is full of caveats, though.

It's wrong to claim indexes are "for" numbers and ASCII. They're for names of things, and names fill in the propositions the relational model is built on.

If you're using surrogate keys for everything, it means your tools are broken or you don't trust them. It may be the least worst available option to plaster over them, but you shouldn't accept that as a best practice.

> For strings that have encodings and locales, you likely need a full-text index, provided by your DB or by something like Solr / ElasticSearch.

Really, a reverse index, which, as its name implies, does the opposite of what you want if you're trying to index something.

Those products "work" by doing text preprocessing (stemming and such), which anyone can do, and then they toss a bunch of garbage back and let the user sort through it.

That doesn't solve the problem a database index is trying to address.

Indexes are not "for", they will happily index anything. But they are "best used for" certain things.

Tools are not broken; reality is. So you have to work around assumptions that are pertinent to blind usage of technology.

You need to think how to use a tool (such as a Unicode-capable DB text type) to get a correct result. And first you need to understand what constitutes the correct result. Else you may end up with strings that are equally correct and nominally equal, but bytewise different, so search by index fails.

So in summary it's a hard problem and requires some smarts to get right? Then we agree.

Retreating to ASCII is not the answer, and "full text search engines" don't even solve the same class of problem, such as say paginating a list of local towns by name.

Um, no?

The most important job of a DB is store records which largely consist of numeric (including boolean) values and strings. Strings often need to be outside the ASCII range, and even in existing cases where they aren't supported, it's a limitation of the application and the users would really prefer unicode strings.

Even if you just target English speaking people from the US or whatever, there are a variety of non-ASCII characters that come up all the time, including accented characters, various currency signs, etc.

The motivations are strong here. You want your database to be as sane as possible.

But this will make some features really hard, if you need them. Pagination based on sorted values subject to collation wouldn't be queryable, you would need to either get all the data and sort it, or query for a key and the sorted column, sort and then query for details on the displayed items.

Selecting a range would also be potentially difficult.

This has the advantage that your database operations run a heck of a lot faster, but has the potential disadvantage that primary key uniqueness may not be maintained, if you think that alternate ways of writing the same characters in unicode matters for that.
As you say, I'd question the technical motivation for enforcing uniqueness on unicode data in the first place (and as a primary key on top of that???)

However, if someone really wanted to accomplish this, they could probably use PostgreSQL's functional indices and unicode normalization to do it.

Normalization is a separate issue, you can normalize and then use the C collation order.
Sure but either you are talking about a fixed normalization algorithm which is not locale aware, in which case it doesn't solve the locale-specific unique key issue, or it is locale-aware and hence suffers the same problem with time-varying behavior.
You are using string/text values as a pk and trying to sort on em? I'd say this is another reason not to do that.
Well I am not doing any of the things in the comment chain leading up to this, but probably the mention of primary key was a red herring. The GP's [1] point was that some index constraint (they mentioned PK uniqueness, but it could really be any constraint) might not be correctly maintained if the DB was not aware of the correct collation order. So from the point of view of the renderer, which is locale aware and uses locale-based collation, the DB is violating the constraints.

---

[1] The GP relative to my reply

Using the C locale certainly helps, but do watch out: you still need to normalize. That means you still need something like ICU.