| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by personZ 4425 days ago

I wonder, have we really pushed relational databases to their breaking point?

The primary limitation of relational databases were traditionally that without expert level optimizations (which, realistically, data-focused organizations should have. But they very seldom do, especially in the start-up space), many queries would generate large numbers of effectively random IO. When you're rolling with magnetic drives, each drive offers maybe 60-150 IOPS, so this quickly becomes an enormous scaling problem. A large storage array offered maybe 2000 IOPS. Scaling becomes entirely about scaling IOPS, as CPU is seldom a limitation in databases.

Add that many firms were starting on EC2 which not only gave you minimal memory, it offered absolutely miserable IOPS performance.

Digg famously, and disastrously, solved this problem by essentially "denormalizing" every bit of data, enormously exploding the raw data they stored, but allowing for individual queries to be entirely localized, often served in a single, large IO: Instead of looking up all of your friends and finding the things they dug, the system would push every bit of data proactively to containers for every possible user. This is the model promoted by many advocates of alternative storage (e.g the advantage of MongoDb is always the "pull a single giant data bag versus pulling it together from various places").

If Kevin Rose dug something, it would update the "things my friends liked" containers for 40,000 or so of his friends, rather than having those 40,000 users check on-demand to see what each of their friends liked.

But they did that right when flash storage was coming into the mainstream. A technology that offers, on simple, inexpensive cards, 100s of thousands to millions of IOPS. Add that RAM has exploded, such that servers with 256GB of memory are very affordable (that was enough to put the entire universe of Digg's data in memory, where of course random IO is in the tens to hundreds of millions).

So now we're at a situation where having non-duplicated, highly relational database is often the highest performance, outside of all of its other advantages, because it fits in memory, and fits on economical flash storage. It has completely flipped the equation.

http://www.commitstrip.com/en/2014/06/03/the-problem-is-not-...

1 comments

tomphoolery 4425 days ago

The Digg thing is interesting...they pretty much took the complete opposite approach of Reddit, who basically store everything in two big SQL tables.

link