|
|
|
|
|
by divxflounder
2918 days ago
|
|
Great article! I'm definitely taking an action item to look into Prometheus. I own DevOps/Monitoring and Alerting my org and it's really cool to see how other companies skin this cat. I saw Cloudwatch in the pipeline, which is an Amazon product. I know I'm going to make a very controversial statement here, but - why Amazon? With volumes like yours, your scale will eventually hit the point where your cost skyrockets. Regarding the metrics themselves, you might already do this, but I highly recommend splitting your metrics into a 50th, 95th, and 99th percentile in your Grafana graphs. This will give you a solid idea of not only what your customers experience on average, but edge cases as well. Do you have a regular forum with how you are reviewing said metrics and pre-solving problems? We're still trying to solve this in multiple teams where I work and have noticed that some teams are great at it and other teams are a little more reactive. Love to see this stuff :) |
|
Re: AWS. We're not at a point where we are overburdened by the AWS spending. Many things are more efficient with AWS, as we have a fairly small engineering team. We use various different AWS products (Aurora, Kinesis, to name a few) that we are utilizing.
Regarding metrics & percentiles - Yes I agree. 99th percentile is what we try to look at the most, as most other metrics tend to be deceiving.
Regular forums - This is something that we need to improve on as we move forward. The blog post mostly describes the infrastructure we've built, but it takes time and effort to become a metric-driven organization.