| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mstump 3462 days ago
	Part of the problem is that you're using databases that can't cope with failure. In large scale production systems things fail all the time. If you've got tech that can cope with failure it's not an issue. Additionally, Docker is pretty handy when you're attempting to manage clusters consisting of thousands of nodes. In that instance enforcing best practices, automating workflows, scaling teams, auditing and preventing configuration drift are much bigger problems than a single server failing.

1 comments

user5994461 3462 days ago

There is no tech in the universe that can cope with cascading failures, like ALL instances of a docker container crashing on ALL hosts one by one in quick succession. This usually happen because an app hits an unexpected bug in the docker disk or the docker network stack and this is the major source of concerns I have with Docker.

link

mstump 3462 days ago

Some systems cope with failure better than others. Everything you've said is also true of DB running on top of a uniform linux stack. From my experience (500+ large scale production deployments) this doesn't happen very often.

Does it solve all problems? No. Does it make the world a little better and is it better than monolithic single points of failure? Yes.

link

carterehsmith 3462 days ago

>> 500+ large scale production deployments

This needs to be qualified....

Did you deploy a single system 500 times, or 500 different systems? Or some combination thereof.

link

mstump 3461 days ago

It's a mix. I'm a consultant that specializes in large scale distributed systems. I have some customers that have >100k production database nodes. I manage probably >50PB of data. I have designed large distributed systems for more than 100 customers.

link

user5994461 3461 days ago

consultant = charge > £600 a day to bring Docker to the company. Yet doesn't care when shit hits the fan 3 months later because he's already gone. In fact, he will never known about it.

By the way, How to have 100 customers => leave right after the design phase every single time. Clients add up quickly.

link

mstump 3455 days ago

I do a mix of pure consulting but also managed services. I typically have a 12 hour SLA for issues, and 1 hour SLA for some customers. 24/7 support for mission critical, revenue generating systems. So no, I'm not just a talking head. It's usually me in the NOC on the hook in case things go wrong. I'm the world expert in this field, if you want things to work at scale people call me.

link

mstump 3455 days ago

Individually I charge several orders of magnitude greater than what you're quoting. I'll advise and design, do deep troubleshooting etc.. Consultants that work for me (I'm the CEO) or a large SI will do the implementation.

link

carterehsmith 3460 days ago

Nice. 100K database nodes! Is that like, Facebook or Twitter?

I hope you write about that somewhere.

link

mstump 3455 days ago

Very large companies, mostly banks, retailers and telecom. Some industrial IoT and a couple governments.

link

user5994461 3462 days ago

Some systems fail more often than others.

link

closeparen 3462 days ago

Sure, but if it fails often enough that you need to prepare to deal with failure, then the number of times you invoke CleanupAfterFailure() doesn't matter so much.

link