Hacker News new | ask | show | jobs
by moonpolysoft 5713 days ago
Wherein Stonebraker misunderstands distributed systems engineering completely.
1 comments

I understood what he said as:

Solving the problems of distributed systems is incredibly hard and not necessary if all you need is scalability and high availability. Better think of your problems in terms of engineering trade-offs and not distributed theory and understand what you are giving up and what this gains you.

When you decide to give up consistency, your application can no longer assume consistency ever. Giving up availability in case of network partition means few extra minutes of downtime a year.

I don't think he completely misunderstood distributed systems - I think he decided to completely side-step the entire field.

> Solving the problems of distributed systems is incredibly hard and not necessary if all you need is scalability and high availability.

Once one scales beyond a single node the system becomes distributed. Then by definition one must deal with distributed systems problems in order to achieve scale beyond the capabilities of a single node.

> When you decide to give up consistency, your application can no longer assume consistency ever. Giving up availability in case of network partition means few extra minutes of downtime a year.

Depends on the application. Giving up availability might mean cascading failures throughout your entire application. For instance if the datastore is unavailable for writes then any kind of queueing systems built around the DB (a common design pattern) run the risk of overflow during the downtime.

And I would make the argument that once an application scales beyond a single datacenter it cannot help but give up strict consistency under error conditions.

> I don't think he completely misunderstood distributed systems - I think he decided to completely side-step the entire field.

If he didn't misunderstand them then he is purposefully ignoring the hard problems. Which is worse?

> Once one scales beyond a single node the system becomes distributed. Then by definition one must deal with distributed systems problems in order to achieve scale beyond the capabilities of a single node

I meant that one doesn't need to solve the general problem of distributed systems. Sharding is a common way to scale avoiding most of the problems generally associated with distributed systems. Scaling within LAN is easier than across data centers. You can assume no malicious traffic between your servers and suddenly solving the byzantine generals problem is far easier.

Purposefully ignoring really hard problems can be a very good engineering practice.

I purposefully ignored the tornado and so it did not hit my datacenter, tear off a section of the roof, kill all power sources, and drench my servers. Hard problems: solved.
Thanks! My app was in that datacenter too. I mean, it had replicated mongodb instances, and well balanced app servers, and nodes going away have no discernable affect on users. Turns out tho, that with all that distributed engineering, I didn't find out that the hosting company doesn't put your nodes in various data-centers. That tornado would have taken down my service during peak hours.

I know you will try to write that off as a "you get what you diserve" but I challenge you to go ask people if their apps would survive a tornado to the data-center. Many of them will say "sure its in the cloud!" Then drop the killer question on them... "How many different data centers are your nodes running on right now". Most will say "i dont know". Some will say "My host has many data centers" (note this doesn't answer the question). A few will actually have done the footwork.

Also, the scenario you describe is as easily mitigated with hot failovers and offsite backups. This probably qualifies as distributed engineering, but only is only the same as the above discussions in the most pedantic senses.

As MongoDB did not exist at the time, it seems unlikely. Such things happen more than we might like, of course!

"Also, the scenario you describe is as easily mitigated with hot failovers and offsite backups."

This is a sadly wrong, though common, belief. There is exactly one way to know that a component in your infrastructure is working: you are using it. There is no such thing as a "hot failover": there are powered on machines you hope will work when you need them. Off-site backups? Definitely a good idea. Ever tested restore of a petabyte?

Here's a simple test. If you believe either of the following are true, with extremely high probability you have never done large-scale operations:

1) There exists a simple solution to a distributed systems problem.

2) "Failover", "standby", etc. infrastructure can be relied upon to work as expected.

Extreme suffering for extended periods burns away any trust one might ever have had in either of those two notions.

Cool story, bro.
Sharding puts you in the same risk category as any other distributed database. Just ask the engineers and ops people at twitter, foursquare, and any number of other web companies that have dealt with sharded databases at scale. Also: sharding is eventually consistent, except you don't get any distributed counter primitives to help figure out which replica is telling the truth.
There's a difference between what ops people see and what developers see. If the ops people have the same headache no matter what, why force the developers to think about consistency?
Good question! The answer is that the "If..." evaluates to false: ops (and by that I hope you mean something more than facilities staff) has a far simpler set of problems with which to deal when the software is designed, implemented, and tested in accordance with physical constraints. I, too, would like a gold-plated, unicorn pony that can fly faster than light, but in the mean time, I'm writing and using software that produces fewer operational and business headaches. Some of that includes eventually consistent databases.
Nice summary.