Hacker News new | ask | show | jobs
by VyseofArcadia 867 days ago
> In general, one of the goals of microservices should be that if one of the five services goes down, the other four should be able to operate in some capacity still.

This is an excellent point, but what brought this to my mind was that the microservices in the Netflix article I don't think have this property. It looks to me if any of the VIS, CAS, LGS, or VES go down, then the whole service is effectively down.

Indeed, in my own career what I've seen is that if one microservice goes down the user won't be seeing 500 errors or friends, but the service will be completely useless to the user. You've just gone from a hard error to a spinning load icon, which might in fact be an even worse user experience.

It could be argued that this is just "you're doing microservices wrong", but then we start getting into no true Scotsman territory.

2 comments

> Indeed, in my own career what I've seen is that if one microservice goes down the user won't be seeing 500 errors or friends

Exactly what it does is that first few hours of triage call goes with people claiming "well my service is up and issue is somewhere else". So find which service failed itself take crucial hours instead of fixing the failing service.

But in a world where Micro Service Incident Commanders can pinpoint failing a service among 1000 micro service within seconds on their vast 80 inch monitoring consoles and direct resolution admirals to fix in next 15 mins. It might just all work fine.

the problem comes when it's a distributed system, and it's the interaction between multiple systems that's causing the problem, and not a specific microservice being down. something got upgraded and the message size changed in an unexpected and incompatible way that worked fine in testing.
> It looks to me if any of the VIS, CAS, LGS, or VES go down,

But the whole point is that by splitting it into micro-services you can efficiently and optimally scale each component individually. So it's extremely rare that VIS for example would entirely go down. And because Netflix has tools like Hystrix if one instance is unavailable it will seamlessly route to another one.

And Even if you push bad code there are techniques like blue/green and canary releases which can be used.