| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by BeeOnRope 2720 days ago

It's easy to say "treat collation as a render level concern", but this doesn't really work efficiently when the rendering component wants to query the database using this index, does it?

That is, to do anything you want to do a locale sensitive way, such as querying the database for a given case insensitive string, or pagination, you'll need to have the DB index be locale/collation aware or else return every possible value and the renderer sort it out.

As an example, how you would you repeatedly return a range of results based on a string in a locale-aware way, e.g., to display paged results, if you defer the work to the renderer?

The only general solution I'm aware of which lets you "bypass" the DB collation is to use a locale-aware collation library like ICU to generate a binary sort key, which can be compared using plain binary comparison and storing those in the DB. This still means the overall index is locale aware, but the DB doesn't need to be aware of any collation rules: only the code that generates queries and handles the results needs to do the sort key transformation.

It means that you have a single library that does all the conversion, which you can probably control more easily, rather than delegating this to the database, where you might need to support several vendors or at least various versions (and problems can arise even within a single DB version as this postgres issue shows).

1 comments

nine_k 2720 days ago

If you build a plain index over Unicode strings, chances are you're doing it wrong.

Normal DB indexes are mostly for numbers and (short) ASCII strings. (Something like canonicalized UTF-8 is an edge case.) For strings that have encodings and locales, you likely need a full-text index, provided by your DB or by something like Solr / ElasticSearch. It addresses the oddities of human-oriented texts better.

link

cbsmith 2720 days ago

I'm sorry but no.

It's a pretty common use case (so common that it is usually taught in an introduction to databases) that you might want an index that allows you to efficiently search by non-ASCII strings like say, a person's last name.

link

nine_k 2720 days ago

I'd posit that this is not universally true. It may be true for English names.

It may be implementable for e.g. French names: a name like "François" may be stored as UTF-8 with "ç" always represented as a composite pair (or always a single character), and the application layer knows and uses this.

It must be a dubious idea for German names when you may need to see "Müller" and "Mueller" as the same name, but also keep e.g. only "Mueller" as the form referenced in legal papers.

Once you start considering anything non-ASCII, things become hairy. (Of course, if you only work for the US market and don't care about any international travelers as customers, you can continue using these introductory courses as a guidance.)

link

BeeOnRope 2719 days ago

Of course it's hairy - that's kind of the point of this whole discussion!

People who use databases naturally end up wanting to store Unicode strings, and people naturally want to query them in locale-sensitive ways. Your Mueller example is a fine example of this. Databases have deep support for this, and you can also build your own compromise, so to speak, on top of the database with ICU or another library if it doesn't meet your needs.

link

nine_k 2719 days ago

Storing is fine!

Indexing is full of caveats, though.

link

cbsmith 2719 days ago

Indexing is always full of caveats. :-)

link

ben509 2720 days ago

It's wrong to claim indexes are "for" numbers and ASCII. They're for names of things, and names fill in the propositions the relational model is built on.

If you're using surrogate keys for everything, it means your tools are broken or you don't trust them. It may be the least worst available option to plaster over them, but you shouldn't accept that as a best practice.

> For strings that have encodings and locales, you likely need a full-text index, provided by your DB or by something like Solr / ElasticSearch.

Really, a reverse index, which, as its name implies, does the opposite of what you want if you're trying to index something.

Those products "work" by doing text preprocessing (stemming and such), which anyone can do, and then they toss a bunch of garbage back and let the user sort through it.

That doesn't solve the problem a database index is trying to address.

link

nine_k 2720 days ago

Indexes are not "for", they will happily index anything. But they are "best used for" certain things.

Tools are not broken; reality is. So you have to work around assumptions that are pertinent to blind usage of technology.

You need to think how to use a tool (such as a Unicode-capable DB text type) to get a correct result. And first you need to understand what constitutes the correct result. Else you may end up with strings that are equally correct and nominally equal, but bytewise different, so search by index fails.

link

BeeOnRope 2719 days ago

So in summary it's a hard problem and requires some smarts to get right? Then we agree.

Retreating to ASCII is not the answer, and "full text search engines" don't even solve the same class of problem, such as say paginating a list of local towns by name.

link

BeeOnRope 2719 days ago

Um, no?

The most important job of a DB is store records which largely consist of numeric (including boolean) values and strings. Strings often need to be outside the ASCII range, and even in existing cases where they aren't supported, it's a limitation of the application and the users would really prefer unicode strings.

Even if you just target English speaking people from the US or whatever, there are a variety of non-ASCII characters that come up all the time, including accented characters, various currency signs, etc.

link