Hacker News new | ask | show | jobs
by nine_k 2720 days ago
If you build a plain index over Unicode strings, chances are you're doing it wrong.

Normal DB indexes are mostly for numbers and (short) ASCII strings. (Something like canonicalized UTF-8 is an edge case.) For strings that have encodings and locales, you likely need a full-text index, provided by your DB or by something like Solr / ElasticSearch. It addresses the oddities of human-oriented texts better.

3 comments

I'm sorry but no.

It's a pretty common use case (so common that it is usually taught in an introduction to databases) that you might want an index that allows you to efficiently search by non-ASCII strings like say, a person's last name.

I'd posit that this is not universally true. It may be true for English names.

It may be implementable for e.g. French names: a name like "François" may be stored as UTF-8 with "ç" always represented as a composite pair (or always a single character), and the application layer knows and uses this.

It must be a dubious idea for German names when you may need to see "Müller" and "Mueller" as the same name, but also keep e.g. only "Mueller" as the form referenced in legal papers.

Once you start considering anything non-ASCII, things become hairy. (Of course, if you only work for the US market and don't care about any international travelers as customers, you can continue using these introductory courses as a guidance.)

Of course it's hairy - that's kind of the point of this whole discussion!

People who use databases naturally end up wanting to store Unicode strings, and people naturally want to query them in locale-sensitive ways. Your Mueller example is a fine example of this. Databases have deep support for this, and you can also build your own compromise, so to speak, on top of the database with ICU or another library if it doesn't meet your needs.

Storing is fine!

Indexing is full of caveats, though.

Indexing is always full of caveats. :-)
It's wrong to claim indexes are "for" numbers and ASCII. They're for names of things, and names fill in the propositions the relational model is built on.

If you're using surrogate keys for everything, it means your tools are broken or you don't trust them. It may be the least worst available option to plaster over them, but you shouldn't accept that as a best practice.

> For strings that have encodings and locales, you likely need a full-text index, provided by your DB or by something like Solr / ElasticSearch.

Really, a reverse index, which, as its name implies, does the opposite of what you want if you're trying to index something.

Those products "work" by doing text preprocessing (stemming and such), which anyone can do, and then they toss a bunch of garbage back and let the user sort through it.

That doesn't solve the problem a database index is trying to address.

Indexes are not "for", they will happily index anything. But they are "best used for" certain things.

Tools are not broken; reality is. So you have to work around assumptions that are pertinent to blind usage of technology.

You need to think how to use a tool (such as a Unicode-capable DB text type) to get a correct result. And first you need to understand what constitutes the correct result. Else you may end up with strings that are equally correct and nominally equal, but bytewise different, so search by index fails.

So in summary it's a hard problem and requires some smarts to get right? Then we agree.

Retreating to ASCII is not the answer, and "full text search engines" don't even solve the same class of problem, such as say paginating a list of local towns by name.

Um, no?

The most important job of a DB is store records which largely consist of numeric (including boolean) values and strings. Strings often need to be outside the ASCII range, and even in existing cases where they aren't supported, it's a limitation of the application and the users would really prefer unicode strings.

Even if you just target English speaking people from the US or whatever, there are a variety of non-ASCII characters that come up all the time, including accented characters, various currency signs, etc.