Hacker News new | ask | show | jobs
by mstump 3416 days ago
Part of the problem is that you're using databases that can't cope with failure. In large scale production systems things fail all the time. If you've got tech that can cope with failure it's not an issue.

Additionally, Docker is pretty handy when you're attempting to manage clusters consisting of thousands of nodes. In that instance enforcing best practices, automating workflows, scaling teams, auditing and preventing configuration drift are much bigger problems than a single server failing.

1 comments

There is no tech in the universe that can cope with cascading failures, like ALL instances of a docker container crashing on ALL hosts one by one in quick succession. This usually happen because an app hits an unexpected bug in the docker disk or the docker network stack and this is the major source of concerns I have with Docker.
Some systems cope with failure better than others. Everything you've said is also true of DB running on top of a uniform linux stack. From my experience (500+ large scale production deployments) this doesn't happen very often.

Does it solve all problems? No. Does it make the world a little better and is it better than monolithic single points of failure? Yes.

>> 500+ large scale production deployments

This needs to be qualified....

Did you deploy a single system 500 times, or 500 different systems? Or some combination thereof.

It's a mix. I'm a consultant that specializes in large scale distributed systems. I have some customers that have >100k production database nodes. I manage probably >50PB of data. I have designed large distributed systems for more than 100 customers.
consultant = charge > £600 a day to bring Docker to the company. Yet doesn't care when shit hits the fan 3 months later because he's already gone. In fact, he will never known about it.

By the way, How to have 100 customers => leave right after the design phase every single time. Clients add up quickly.

I do a mix of pure consulting but also managed services. I typically have a 12 hour SLA for issues, and 1 hour SLA for some customers. 24/7 support for mission critical, revenue generating systems. So no, I'm not just a talking head. It's usually me in the NOC on the hook in case things go wrong. I'm the world expert in this field, if you want things to work at scale people call me.
Individually I charge several orders of magnitude greater than what you're quoting. I'll advise and design, do deep troubleshooting etc.. Consultants that work for me (I'm the CEO) or a large SI will do the implementation.
Nice. 100K database nodes! Is that like, Facebook or Twitter?

I hope you write about that somewhere.

Very large companies, mostly banks, retailers and telecom. Some industrial IoT and a couple governments.
Some systems fail more often than others.
Sure, but if it fails often enough that you need to prepare to deal with failure, then the number of times you invoke CleanupAfterFailure() doesn't matter so much.