| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by barrkel 3357 days ago

What's "the database" that you have in mind?

Start out with the idea that you have hundreds of machines in your cluster, with 1000s of TB of data. Suppose the current data efficiency is on the order of 80% - that is, 80% of the 1000s of TB is the actual bytes of the data fields. What database do you have in mind to store this data, still on the order of 1000s of TB?

You say: a couple of joined tables. So you have hundreds of machines, and the tables are not all going to fit on one machine; they're going to be scattered across hundreds of machines each. How do you efficiently do a join across two distributed tables?

It's no picnic.

If each row in one table only has a few related rows in the other table, it's much, much better to store the related data inline. Locality is key; you want data in memory right now, not somewhere on disk across the network.