Hacker News new | ask | show | jobs
by detkin 1795 days ago
That's a great question, failure rate and MTTR are the two metrics teams have the most trouble getting a real handle on. We've found that different teams define change failure and MTTR in widely differing ways. Some customers just want to track incidents where others are using team KPIs as their definition of failure.

Today, you can manually update the status of a deploy as an incident, rollback, unhealthy or ailing. This allows you to "correct" data that Sleuth may have gotten wrong via integrations to Datadog or your incident management system. Right now the correction is at the deploy level. However, we do have more control coming soon so you can override any period of time as having been in a specific state.