Hacker News new | ask | show | jobs
by oceanplexian 967 days ago
All it takes is for one microservice to start hanging on a GRPC request, server hardware stops doing some fundamental thing correctly, or some weird network quirk that 10x’s latency to half the switch ports in a rack, and you end up with insane, sophisticated cascade failures.

Because engineers don’t have to understand infra, it often spans geographies and failure domains in unanticipated, undetectable ways. In my opinion the only antidote is a thorough understanding of your stack down to the metal it’s running on.

1 comments

A single engineer can’t understand everything at scale.

Even in a 100 person startup that I worked for where I designed the infrastructure and the best practices and wrote the initial proof of concept code and best practices for about 15 microservices it got to the point where I couldn’t understand everything and had to hire people to separate out the responsibilities.

We sold access to micro services to large health care organizations for their websites and mobile app's. We aggregated publicly available data on providers like licenses, education etc.

Our scaling stood up as we added clients that could increase demand by 20% overnight and when a little worldwide pandemic happened in 2020 causing our traffic to spike