| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mind-blight 1265 days ago

I really appreciate the context. We have an SPA with the frontend deployed on vercel and a GraphQL backend hosted on fly. The outage yesterday manifested as 502 errors being delivered to users on the frontend. We had another outage alert at 08:00 PST this morning that lasted about 5-10 minutes. It seemed like the same issue, so we didn't report another incident.

I really like fly, and I think you all are building a great product, but it's looking likely that we're going to migrate off of it. The biggest driver of that has been communication and issues with the status page. Specifically,

- When an incident occurs, we're often among the first to report it on the forum. Over the last month, the status page has lagged pretty significantly behind the incidents. This makes it feel like the we're discovering the issue before fly (I don't know if that's true, but that's the perception). Given that our automated tools are alerting us, it's disconcerting to feel like we're keeping a closer eye on our box's health than our cloud provider (again, this is perception based on communication lag, not necessarily reality).

- We have had multiple outages over the last month. In the middle of an outage, while there is an incident banner displayed at the top of the page, all systems show green with 99.98% or 99.99% uptime. That makes us not trust the numbers on the status page. This reinforces the above perception that fly's systems aren't being accurately monitored. Even now, the status page shows 100% uptime for all systems yesterday and today, which is not true.

- We emailed yesterday about our frustrations and concerns - specifically talking about the disconnect between fly's status page and the multiple outages. We explicitly called out the two points above, and how the communication up to this point has been "We've implemented a fix and are monitoring it". We asked for more details about what occurred, and what was being done to mitigate it in the future. The response was pretty boilerplate: "We're sorry you're frustrated. Here are some credits. We've implemented a fix and are monitoring it. Please let us know if you are still encountering issues."

The incidents were a problem, but disconnect between what was communicated and what occurred through multiple channels is what's driving us to leave. Here's what likely would have convinced us to stay:

- Over-communicate during the incident. I'd prefer to see more status updates rather than fewer.

- Having clear, proactive incident notification. Even with automated monitoring, things will slip through the cracks, but everything over the last month has felt reactive.

- Make sure the status page clearly reflects reality. If the system is down and everything shows green, then I'm 1) frustrated, and 2) wondering what else is slipping through the cracks.

- Publish retro docs or incident reports after an incident. Specifically, report what changes are being made to prevent an outage going forward.

- Train the support staff to communicate directly with developers. Boilerplate emails that focus on empathizing rather than informing are generally frustrating. Especially if they don't actually answer the questions being asked. I get that it's not reasonable to expect a support person to have an in-depth technical conversation, but this is where public incident reports (or live incident pages) can be really helpful.

I think you all are making a great product, but the issues with alerting, monitoring, and communication are too impactful for our production application. I'm confident you'll figure it out, but it's unlikely that we're going to wait.