Hacker News new | ask | show | jobs
by lawrjone 1089 days ago
I work at https://incident.io/ where we build a product that aims to help with this.

I'll callout some of the points you make that we can help with:

1. Our incidents are created automatically from triggers like PagerDuty incidents/OpsGenie alerts/etc and we'll pull any information we find into the Slack channel and make it easily available (pinning to channel, setting it automatically as a channel bookmark, etc). That tends to help when you jump into the incident fresh, having everything easily available.

2. We don't do much matching against previous incidents (yet) but it's easy to search for similar incidents in our dashboard. Unlike alerts, incidents have a history of updates and curated detail about how they were resolved, so a history of similar incidents is genuinely useful to you if you're facing a similar problem.

5. We have an in-product catalog where you can store features, services, etc and who owns what. Most customers ask people 'what is affected?' and have us automatically page or say who owns the feature, which really helps speed up response. Some of our customers have 5k+ services, there's no way humans can remember who owns what at that scale.

6. This is our bread-and-butter, in that we plug into everything like status pages (we offer a native status page ourselves), Jira, GitHub, whatever to make sure incident updates are pushed everywhere. The idea is responders update the incident and we'll go share that everywhere it needs to go, instead of asking people to remember when they're busy responding.

7. Our incidents help massively with this. Provided responders are pushing updates to their incidents, an on-call handover turns into a super-quick review of the incidents in our dashboard and a review of the updates/outstanding actions.

tl;dr: a lot of what you've described can be fixed or helped massively by good tooling. Even stuff like runbooks being out-of-date is improved by tools that more frequently connect people to runbooks, as if they're more reliably useful people are more incentivised to update them.

Won't solve everything but if you're on the lookout for solutions you should definitely check-out incident.io and similar tools.