Hacker News new | ask | show | jobs
by ben509 2722 days ago
It's wrong to claim indexes are "for" numbers and ASCII. They're for names of things, and names fill in the propositions the relational model is built on.

If you're using surrogate keys for everything, it means your tools are broken or you don't trust them. It may be the least worst available option to plaster over them, but you shouldn't accept that as a best practice.

> For strings that have encodings and locales, you likely need a full-text index, provided by your DB or by something like Solr / ElasticSearch.

Really, a reverse index, which, as its name implies, does the opposite of what you want if you're trying to index something.

Those products "work" by doing text preprocessing (stemming and such), which anyone can do, and then they toss a bunch of garbage back and let the user sort through it.

That doesn't solve the problem a database index is trying to address.

1 comments

Indexes are not "for", they will happily index anything. But they are "best used for" certain things.

Tools are not broken; reality is. So you have to work around assumptions that are pertinent to blind usage of technology.

You need to think how to use a tool (such as a Unicode-capable DB text type) to get a correct result. And first you need to understand what constitutes the correct result. Else you may end up with strings that are equally correct and nominally equal, but bytewise different, so search by index fails.

So in summary it's a hard problem and requires some smarts to get right? Then we agree.

Retreating to ASCII is not the answer, and "full text search engines" don't even solve the same class of problem, such as say paginating a list of local towns by name.