| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by elchief 4993 days ago

When you have so many writes that sharding isn't enough.

When you make changes so fast you need a liquid schema.

When you want to make your boss learn map-reduce so he can query the data.

When the application can take care of integration and not the database itself.

2 comments

leif 4993 days ago

This will be obvious, but I work at Tokutek.

> When you have so many writes that sharding isn't enough.

TokuDB does indexed insertions very fast. [1] We've even plugged ourselves in underneath MongoDB (just for fun) and we beat them too. [2]

> When you make changes so fast you need a liquid schema.

TokuDB supports lots of schema changes with zero downtime. [3]

> When you want to make your boss learn map-reduce so he can query the data.

Can't help you there, but I can make you not need to torture your boss that way.

> When the application can care of integration and not the database.

What if it didn't need to?

We also get fantastic compression [4], retain full transactional semantics, and lots of other fun stuff.

Email us if you're curious!

[1]: http://www.tokutek.com/resources/benchmark-results/benchmark...

[2]: http://www.tokutek.com/2012/08/10x-insertion-performance-inc...

[3]: http://www.tokutek.com/resources/benchmark-results/benchmark...

[4]: http://www.tokutek.com/resources/benchmark-results/benchmark...

link

DanWaterworth 4992 days ago

In my experience, it's not just throughput that's important, but also 99th percentile latency. If I understand fractal trees correctly, you sometimes need to rewrite all of your elements on disk. How do you do this without causing lag?

link

leif 4992 days ago

That's happily not how fractal trees work, at all. We have a few talks online describing how they work. Zardosht has one here http://vimeo.com/m/26471692 . I thought Bradley had a more detailed one at MIT but I can't find it right now. (EDIT: found it! http://video.mit.edu/watch/lecture-19-how-tokudb-fractal-tre...)

Basically, I think you're thinking of the COLA. What we implement does have a literal tree structure, with nodes and children and the whole thing, so at any point you're just writing out new copies of individual nodes, which are on the order of a megabyte. At no point do we have to rewrite a large portion of the tree, so there aren't any latency issues.

link

DanWaterworth 4992 days ago

Thank you very much for the links. Is my understanding correct?

Essentially, a fractal tree is a B-Tree (or perhaps a B+Tree?) with buffers on each branch (per child). Operations get added to these buffers and when one becomes full, the operations get passed to the corresponding child node. Operations are applied when they reach the node that is responsible for the data concerned.

link

leif 4992 days ago

That's about it! It's like a B+Tree in that the data resides in the leaves (well, and in the buffers), except that the fanout is a lot lower because we save room in the internal nodes for the buffers.

link

alex137 4991 days ago

Hi. I thought the block size is much bigger in fractal trees (like 4MiB instead of 4KiB) than in B-trees hence the fanout would be about the same?

I'm trying to experiment with these ideas on my side, but can't quite grok how large the buffers must be at each level of the tree.

Let's take a concrete example like: 2^40 (1T) records with an 8-byte key and an 8-byte value. In a traditional B-tree with a 8KiB block size and assuming links take 8 bytes too and assume for a while that blocks are completely full. So, that's 1G leaf blocks, a fanout of about 1K and hence & full 4-level trees: 1 root block, 1K level-2 blocks, 1M level-3 blocks, 1G leaf blocks.

In this setting, I understand that a fractal tree will attach a buffer to each internal node. How large will they be at level 1 (root), level 2, and level 3 ?

link

DanWaterworth 4992 days ago

I'm a little confused. I thought fractal trees worked cache obliviously or am I mistaken?

link

stinkypete 4990 days ago

Sounds like a Y-Tree to me: "A Novel Index Supporting High Volume Data Warehouse Insertions," C. Jermaine et. al, VLDB '99.

How are fractal trees different?

link

leif 4988 days ago

Is that the same as Willard's Y-fast tree? If so, then no they're not very similar. If not then I'm unfamiliar and I should read that paper.

link

stinkypete 4986 days ago

Nope, not the same. From the descriptions you give above it is very similar (a modified B-tree with heaps of delayed insertions in internal nodes), though the analysis in the paper does not use CO techniques.

This tweak to B-trees to support fast insertions works so well I am really surprised it has not become more common in practice. The only downside I can think of is that flushes down the tree when internal buffers get large may cause lower-level pages to become overfull, but if you allow more than one flush operations at each level you can avoid this problem. Of course, that's only important if you are paranoid about internal nodes actually fitting in a page; if you don't care about that, then the worst-case analysis picture looks much better since a flush operation causes at most one flush in the pages immediately below it in the tree structure.

link

lwat 4993 days ago

> When you have so many writes that sharding isn't enough.

Does Mongo still have a global write lock?

link

elchief 4993 days ago

no longer (but only a few months)

link

mgummelt 4993 days ago

Though I believe it still has a per-collection lock.

link

Jare 4992 days ago

It has per-database locking. Per-collection is being worked on but no planned date yet. [1]

[1] http://www.mongodb.org/display/DOCS/How+does+concurrency+wor...

link