| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by AtlasBarfed 597 days ago

Probably related to the fundamental problem of joining distributed data within CAP constraints. Virtually all distributed databases offering full SQL are CP (that is, they assume no nodes will be down otherwise the data won't return).

If you have distributed data, the join will get calculated by SOME node in the network, and the data will have to be streamed in and joined by the central processor. Even with modern meganodes, for BigData marketing you have to handle arbitrarily sized datasets, and that means streaming data into the processing nodes working memory.

Of course there are ways to distribute join calculation (sometimes) as well, but you're still talking merging streams of data coming into processing nodes.

Now, if you have to handle AP/eventually consistent models, then it REALLY gets complicated, and ultimately your huge massive join (I'm assuming a join of tables of data, not just a denormalization join of a single row/primary key and child foreign keys) is a big eventually consistent approximation view, even without the issue of incoming updates/transactions mutating the underlying datasets as you stream and merge/filter them.