Hacker News new | ask | show | jobs
by ertian 1026 days ago
Oof, yes. I used to be an SRE at Google, with oncall responsibility for dozens of servers maintained by a dozen or so dev teams.

Trying to track down issues with requests that crossed or interacted with 10-15 services, when _all_ those services had logs full of 'normal' errors (that the devs had learned to ignore) was...pretty brutal. I don't know how many hours I wasted chasing red herrings while debugging ongoing prod issues.

1 comments

we're using AWS X-ray for this purpose, i.e. a service is always passing on and logging the X-ray identifier generated at first entry into the system. pretty helpful for this purpose. And yes, there should be consistent log handling / monitoring. Depending on service we differ between error log level (=expected user errors) and critical error level (makes our monitor go red).
It often isn't as simple as using a correlation identifier and looking at logs across the service infrastructure. If you have a misconfiguration or hardware issue it very likely may be intermittent and only visible as an error in a log before or after the request. The response has incorrect data inside a properly formatted envelope.
I guess that's one of the advantages of serverless - by definition there can be no unrelated error in the state beyond the request (because there is none), except for the infrastructure definition itself. But a misconfig there you'll always see in form of an error happening at calling the particular resource - at least I haven't seen anything else yet.
That's assuming your "serverless" runtime is actually the problem.