Hacker News new | ask | show | jobs
by divxflounder 2918 days ago
Great article! I'm definitely taking an action item to look into Prometheus. I own DevOps/Monitoring and Alerting my org and it's really cool to see how other companies skin this cat.

I saw Cloudwatch in the pipeline, which is an Amazon product. I know I'm going to make a very controversial statement here, but - why Amazon? With volumes like yours, your scale will eventually hit the point where your cost skyrockets.

Regarding the metrics themselves, you might already do this, but I highly recommend splitting your metrics into a 50th, 95th, and 99th percentile in your Grafana graphs. This will give you a solid idea of not only what your customers experience on average, but edge cases as well.

Do you have a regular forum with how you are reviewing said metrics and pre-solving problems? We're still trying to solve this in multiple teams where I work and have noticed that some teams are great at it and other teams are a little more reactive.

Love to see this stuff :)

1 comments

One of the authors here. Thanks for enjoying the article!

Re: AWS. We're not at a point where we are overburdened by the AWS spending. Many things are more efficient with AWS, as we have a fairly small engineering team. We use various different AWS products (Aurora, Kinesis, to name a few) that we are utilizing.

Regarding metrics & percentiles - Yes I agree. 99th percentile is what we try to look at the most, as most other metrics tend to be deceiving.

Regular forums - This is something that we need to improve on as we move forward. The blog post mostly describes the infrastructure we've built, but it takes time and effort to become a metric-driven organization.

Pretty unofficial here, but I prefer engineering channel to biz dev channel... Drop me a note, loop in whoever would be interested? I’ve been meaning to get our companies better acquainted — your fantastic write up reminded me.