Hacker News new | ask | show | jobs
by jadams3 2405 days ago
This. I have a really hard time measuring it, but ever since we really worked on error reporting our week-end sleep factor has greatly improved.

For a complex system though, don't under estimate how hard this is to do though ... - Every cloud service needs to be routed to a common service - All of your software, every language, even that cool Go experiment - All of the third party software - logs all have to agree on a format, JSON is not always an option.

Finally ... justification of time spent fixing things with no observable side effect(s). Most cloud stuff is reliable against first orders of failure and so are tolerant to a lot of stuff, it's designed that way. But once the wheels come off, and they will come off, ... buckle up if you haven't been fixing those errors. If you aren't clean on second order failures, you're in for a rough ride.

1 comments

We use AWS, and one benefit of their hosted ElasticSearch is that they can build you a lambda that syncs Cloudwatch logs to ES, handling a variety of different formats. So we have our beanstalk web requests + some lambda infra + our main web backend etc. all synced to ES with very little effort.

You do have the downside that they don’t have eg nicely synced structure, but that also has the upside that the structure is closer to what the dev is used to so nobody ever needs to go back to CloudWatch or any other logs to get more details or a less processed message. The other downside is you have to write a different monitor for each index, though this has the upside that you can also have different triggers per index. In our small team we just message different slack channels which makes for a nice lightweight opt in/out for each error type.

It’d definitely be tricky to get everything aligned in eg the same JSON format, but this sort of middle ground isn’t too hard and still has benefits - you just need to be already syncing in any format to CloudWatch - which if you’re in AWS you probably are.