Hacker News new | ask | show | jobs
by ergest 4378 days ago
Is it me or are people switching to non-relational data warehouse architectures simply because it's en vogue? How many companies do you know that have enough data where a non-relational DW would actually make sense? I wonder, have we really pushed relational databases to their breaking point?
7 comments

I've looked at and avoided doing anything serious with hdfs/mr for 6 years now. I'm glad some people are starting to realize that re-processing your entire dataset every single time you want to do something isn't very efficient. I'm still waiting for lightbulb moment where the usefulness of it really makes sense to me.

Can anyone point me to a book or blog that discusses good uses of hadoop/map-reduce?

I'm waiting for the day people realize that materialized views in databases are awesome and decide to incorporate them into a framework.
At least if you're using Oracle they are, as it supports auto-refreshing. Postgres has only had them since 9.3 (and have to be manually refreshed). Meanwhile MySQL is still struggling with regular views.
Simplistically speaking, you don't always have to do table scans. I run into this every day: "Let's use Hadoop and keep doing full table scans! It's scalable! We just add more machines!" Yeah, except continuing to scan all of your growing data each time you need it is inherently unscalable. :(
It's often cheaper to use the "wrong" architecture than optimising the right architecture. I know I could have a CouchDB datastore searching a few GB with an afternoon of work. I imagine I could get MySQL fast enough with a few days of optimisation. In terms of time, which is by far the biggest cost in most development, CouchDB is the better option.
In terms of time, which is by far the biggest cost in most development, CouchDB is the better option.

For a single, one-off utility, sure. For anything that you ever planned for production, that would be crazy.

Just to be clear, the mentality that onion proposes (at least from my interpretation, though I apologize if I'm misunderstanding), usually justified under a gross misinterpretation of the "premature optimization" warning, is exactly how disaster implementations that end up failing or requiring enormous amounts of engineering time to try to triage and bandage into something usable.

At least for the startup world, it's about prioritisation of concerns. Will that disaster implementation take me to my next(or first) round of funding? If yes, I'll happily go with it. After that, I can throw money at the problem.
Disaster recovery is easy to put off forever because you don't need it until you do. When it happens it can also kill off your company.

I've been involved in companies that went 14 years without a disaster. Another company I was involved with had 2 in a span of 2 months, each taking between 2 and 3 days to recover from.

Regardless of whether I need it or not, I sleep better at night knowing a decent plan is in place. Which means I can perform better during the day.

Yeah, but was it your choice of DB that killed you or something else? That something else is always more likely to happen and more dangerous than 'oh noes all my data is gone stupid mongo/couch!' as if that ever really happens.
Granted, I was making some assumptions about the original post. For me it's not about the choice of DB, it's about having the proper knowledge, time, and team to be able to setup a production environment that isn't seriously flawed in one or more ways.

I would need a hell of a lot more than a time savings of 1-2 days to add a whole new database technology to my production environment. Even if I know the tech the installation, configuration, automated backups, and automated validation of backups will likely consume more than 1-2 days to get setup. Then add on the learning curve aspect if I've never used it in a real production environment. Then add on the learning curve for any team members who might not be familiar with it.

Weeks, maybe. A day or two: not worth it.

That's just like your opinion though. You ever used CouchDB in production before?
I passed zero judgment on CouchDB, but was responding specifically to the notion that doing something "wrong" if it saves you a small amount of development time at the outset is fine. When these are foundational things like your data tier, such an attitude is a primary ingredient in project failure.
Nah, dude, you're saying that it's ok for one-off projects, but you'd be crazy to use it in production because it's gonna blow up on your eventually. That's not true at all. Plus, ya'll arguing about conjectures with that catastrophic failure stuff.
We switched to Redshift for our data warehouse because it was MUCH cheaper, declarative, allowed us to retain the relational model, and abstracted away most of the admin. Very happy so far. ~5,000 tables, 7-10TB all maintained by one guy in his spare time.

The workbench options weren't amazing so we made this for our BI team: https://github.com/zalora/redsift/

A lot are. I used to want to just because it was cool. Now that I actually do though my main use case is sifting through hundreds of GB of unstructured data. I use Hadoop to get the data into a structured form that I can then load into Redshift. It's awesome knowing that just about anything I throw at Redshift, it can handle.
I have another question: Why are people still doing joins in this day and age? Big data + joins = teh suck.

I'm a big fan of compressed denormalized data.

I wonder, have we really pushed relational databases to their breaking point?

The primary limitation of relational databases were traditionally that without expert level optimizations (which, realistically, data-focused organizations should have. But they very seldom do, especially in the start-up space), many queries would generate large numbers of effectively random IO. When you're rolling with magnetic drives, each drive offers maybe 60-150 IOPS, so this quickly becomes an enormous scaling problem. A large storage array offered maybe 2000 IOPS. Scaling becomes entirely about scaling IOPS, as CPU is seldom a limitation in databases.

Add that many firms were starting on EC2 which not only gave you minimal memory, it offered absolutely miserable IOPS performance.

Digg famously, and disastrously, solved this problem by essentially "denormalizing" every bit of data, enormously exploding the raw data they stored, but allowing for individual queries to be entirely localized, often served in a single, large IO: Instead of looking up all of your friends and finding the things they dug, the system would push every bit of data proactively to containers for every possible user. This is the model promoted by many advocates of alternative storage (e.g the advantage of MongoDb is always the "pull a single giant data bag versus pulling it together from various places").

If Kevin Rose dug something, it would update the "things my friends liked" containers for 40,000 or so of his friends, rather than having those 40,000 users check on-demand to see what each of their friends liked.

But they did that right when flash storage was coming into the mainstream. A technology that offers, on simple, inexpensive cards, 100s of thousands to millions of IOPS. Add that RAM has exploded, such that servers with 256GB of memory are very affordable (that was enough to put the entire universe of Digg's data in memory, where of course random IO is in the tens to hundreds of millions).

So now we're at a situation where having non-duplicated, highly relational database is often the highest performance, outside of all of its other advantages, because it fits in memory, and fits on economical flash storage. It has completely flipped the equation.

http://www.commitstrip.com/en/2014/06/03/the-problem-is-not-...

The Digg thing is interesting...they pretty much took the complete opposite approach of Reddit, who basically store everything in two big SQL tables.
If you optimize for latency relational databases won't cut it.
I'm the author of the article. At Monetate, we've chosen our data warehouses to maximize throughput, rather than minimize latency. That's where something like Redshift really shines, it's great a large bulk ingests and running large queries relatively quickly, but awful at running lots of small queries quickly.

On our busiest day last year, we ingested over a quarter billion page views across all of our clients' websites. I'm sure someone has made MySQL scale to that volume, but for us Redshift has been working great for a relatively low price point.

Thank you for sharing your experience! It's always inspiring to read well-written articles as is yours!
This might be the most inaccurate statement on the internet.
OK, I should have written "if you optimize for read-access latency". Better?