Hacker News new | ask | show | jobs
MySQL (InnoDB) clustered and non-clustered indexes (n3n.in)
23 points by skipass 4404 days ago
4 comments

I saw a presentation on an interesting hack done with InnoDB primary keys used to cluster similar data.

A hotel booking company created a primary key which would put geographically similar entries into primary keys which were close together. i.e. there wouldn't be an entry for a hotel in Saigon ordered between two entries for San Francisco.

I don't recall the specifics of how they did it, but due to how InnoDB orders pages on disk according to their primary key, it meant that all of the results for a particular location could be accessed by fetching one to two pages of data from disk.

Resulted in some huge performance gains for them, both by reducing the number of disk IOs required for your average query, and by reducing the on-the-fly geographical calculations they had to do in the DB. They could do a range query on the primary key in the database and then do fine-grained geolocation on a less contended resource against a small number of entries.

The presentation was probably this one: http://www.percona.com/live/mysql-conference-2014/sites/defa...

Slides 16-18.

A Z-order curve maps from two (or more) dimensions to one while preserving locality: http://en.wikipedia.org/wiki/Z-order_curve
Depending on your application you might want to pick a different space filling curve (http://en.wikipedia.org/wiki/Space-filling_curve). Different curves lead to different characteristics for nearness. You can also use multiple types of curves or two of the same curve oriented differently and then use the minimum distance of the curves as your distance estimate for some applications.

If you just want to generate keys you should look at Locality-Sensitive Hashing (http://en.wikipedia.org/wiki/Locality-sensitive_hashing).

That is most likely it, I was guessing they used the interleaving method presented there, but didn't know the formal term.
A nice little article. I've always felt that database indexes are one of those fringe topics that's just esoteric enough that people acquire a basic knowledge of them and know they need them, yet understanding indexes and using them correctly leads amazing performance.

On clustered indexes in particular, I try to follow these principles:

Narrow - in terms of a data type's byte length, so that more keys can be packed into each level of the B-tree. You can end up having to traverse fewer intermediate levels to reach the leaf level, where the data resides.

Unique - This ties into the point above, there will be no need to add a "uniquifier" to the key, helping to save space and the overhead of managing that extra tid-bit of data

Static - ideally, never changing. By it's nature, a clustered index is ordered, and updating/changing the keys will lead to data that's in the wrong place. You can kind of get around this by rebuilding the index after every update, but that just adds another task you'll need to manage more often.

Increasing - this can lead to faster inserts; in a sense, the DB is just filling up the last page of the index and then adding another when it needs to. (I think after a certain amount of inserts, eventually you'll have to re-build the index (to add more intermediate level nodes) but I can't recall the specifics, and you're going to have to do rebuild indexes anyway with all indexing strategies.)

If, like me, you found the idea interesting but the article a bit confused there's another good discussion on it here: http://use-the-index-luke.com/blog/2014-01/unreasonable-defa...
I think it would have been easier for me to understand if they put everything in terms of a B+ tree, where the leaf nodes contain keys and values.
InnoDB does indeed use a B+ tree. A more thorough article on the topic is http://blog.jcole.us/2013/01/10/btree-index-structures-in-in... or the book High Performance MySQL.
I remember when I was learning SQL, I thought "Why doesn't the database just look up the records faster by storing them in a B+ tr.... Oh, wait, it does that already." SQL is FAST!
I know it uses a B+ tree. I wanted to mention that the article doesn't mention them at all.

I'm familiar with the High Performance MySQL book -- I work with one of the authors :).

Cool. I was more taking the opportunity in replying to you to point out that this article isn't very good :)