Hacker News new | ask | show | jobs
by elbee 4687 days ago
That looks extremely interesting! The idea of amortizing the cost of inserts is fascinating. Looking at the design you sketched a few questions come to mind:

1) Multi-threading: suppose I seek down the B-tree for key K. Most B-tree implementations use the latch on the node containing K as the final arbiter of concurrency. For example, if I'm looking at K and then I want the next row (perhaps because I'm using the new Index Condition Pushdown optimization in MySQL 5.6, or I'm doing an online index build and need to scan all the rows) then I can simply look at the next row on the page I currently have (read) latched. With a fractal tree it looks like I have to worry about someone inserting a row immediately after the current row because that insert could have been cached at a higher level. Does this mean I need to keep some sort of latch/lock on the entire b-tree path down to the page I'm reading, instead of using latch coupling to work my way down? Alternatively do I have to work my way down from the top of the tree every time I want the next key?

2) How can you check for uniqueness? Suppose I create a table like this:

CREATE TABLE t1 (id NUMBER PRIMARY KEY, val1 VARCHAR2(30));

Amortizing the inserts seems to imply that primary key uniqueness violations can't be discovered until the inserts are pushed all the way down to the leaf?! In general uniqueness is an important part of data normalization, a good input for query optimization and normally required for foreign key constraints...

2 comments

It is highly concurrent but you need some tricks beyond that simple description. Here's how we did it: http://www.tokutek.com/2013/01/concurrency-improvements-in-t... http://www.tokutek.com/2013/02/concurrency-improvements-in-t...

Yes, unique checks are bad. They make it perform as badly as a B-tree for unique inserts. There are sometimes ways around that but at some level if you aren't reading in the leaf node, you're going to be at a loss for some information. B-trees seem to have spoiled users into thinking uniqueness checks don't make inserts any more expensive, when in fact that's just because B-tree inserts are already that slow.

So out of the main b-tree operations (Insert/Replace/Delete/Seek/Next/Prev) you make a convincing argument that Tokutek can be faster than b-trees for inserts, if you use non-unique indexes (which automatically disqualifies the primary index in most cases, and makes foreign keys hard).

That is good but not quite "beat B-trees pretty handily across the board" :-)

Many primary keys in the wild are sequential, which makes the unique check hit cache and not require any I/O.

Let me be clear: in the worst case with a unique index, Fractal Tree indexes are the same speed as B-tree indexes. In most cases, they're far better. When you add in compression and agility, it really is "beat B-trees across the board."

To be honest, I have no idea what "agility" is. Perhaps I am too cautious but I'll want to see a lot more data than some hand-picked benchmarks and a mystical "unique constraints aren't interesting" before I'm completely convinced.

[Random aside: I greatly dislike block (i.e. InnoDB-style) compression, which is what Tokutek seems to use. Decompressing the entire block just to retrieve one column of one row is crazily expensive, especially as the decompression cost goes up as the block size goes up. There is also the problem of deciding whether to store compressed blocks in the buffer cache (slow), decompressed blocks in the buffer cache (wastes ram) or to have a cache of uncompressed blocks (the InnoDB approach, which kind of combines both problems). I think that format-aware page or column compression (like Oracle, SQL Server or ESENT) is far more effective.]

I guess that this has gotten way off topic but I am glad there are people out there doing exciting new things in the B-tree space. I will keep a closer eye on Tokutek in the future.

Agility refers to our online schema change abilities in mysql (hot column add/delete, hot indexing).

Our compression technique is naive, you're right (but we don't decompress the whole 4MB for one single point query, as you may be thinking, we're a bit more sophisticated). We have some more ideas in the pipeline if we need them (basically a bunch of tricks you can play when you know the schema), but the fact remains that what we have right now is extremely effective and there isn't much incentive yet to finish off our smarter prototypes.

(Don't worry about off topic I love deep conversations like this. Come to our mailing list if you want to continue more though, we may be exhausting HN)
I now realize I didn't answer how we check uniqueness. We just do a query like any other normal point query, using a serializable transaction, and if we pass the check, then we do the insert with the same transaction (which took a row lock when we did the query because it was serializable, so nothing could have violated the uniqueness between the query and the insert).
Also, MVCC snapshots mean that if you're doing a range query, it doesn't matter if someone comes in and injects a message above you while you're scanning a node.