| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dmattia 867 days ago

In general, one of the goals of microservices should be that if one of the five services goes down, the other four should be able to operate in some capacity still.

In practice, this can make the math quite a bit messier, but I don't think it necessarily has been worse overall from my perspective.

So instead of having your system be up or down 99% of the time in a monolith, you'll have it fully up 95% of the time (using your numbers), but of that 5% of downtime, 20% of the time one of your products will be running slowly, or 10% of the time some new feature you launched won't work for specific customers in some specific region, etc.

At my company it makes things like SLA/SLO guarantees for "our services" pretty complicated in that it's hard to define what uptime truly means, but overall I think the five microservice approach, when done well, should have less than 1% of complete downtime, at the cost of more partial downtime

1 comments

VyseofArcadia 867 days ago

> In general, one of the goals of microservices should be that if one of the five services goes down, the other four should be able to operate in some capacity still.

This is an excellent point, but what brought this to my mind was that the microservices in the Netflix article I don't think have this property. It looks to me if any of the VIS, CAS, LGS, or VES go down, then the whole service is effectively down.

Indeed, in my own career what I've seen is that if one microservice goes down the user won't be seeing 500 errors or friends, but the service will be completely useless to the user. You've just gone from a hard error to a spinning load icon, which might in fact be an even worse user experience.

It could be argued that this is just "you're doing microservices wrong", but then we start getting into no true Scotsman territory.

link

geodel 867 days ago

> Indeed, in my own career what I've seen is that if one microservice goes down the user won't be seeing 500 errors or friends

Exactly what it does is that first few hours of triage call goes with people claiming "well my service is up and issue is somewhere else". So find which service failed itself take crucial hours instead of fixing the failing service.

But in a world where Micro Service Incident Commanders can pinpoint failing a service among 1000 micro service within seconds on their vast 80 inch monitoring consoles and direct resolution admirals to fix in next 15 mins. It might just all work fine.

link

fragmede 867 days ago

the problem comes when it's a distributed system, and it's the interaction between multiple systems that's causing the problem, and not a specific microservice being down. something got upgraded and the message size changed in an unexpected and incompatible way that worked fine in testing.

link

threeseed 867 days ago

> It looks to me if any of the VIS, CAS, LGS, or VES go down,

But the whole point is that by splitting it into micro-services you can efficiently and optimally scale each component individually. So it's extremely rare that VIS for example would entirely go down. And because Netflix has tools like Hystrix if one instance is unavailable it will seamlessly route to another one.

And Even if you push bad code there are techniques like blue/green and canary releases which can be used.

link