| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by onion2k 1535 days ago

I don't think you can say that. They might know what happened, but it could still be hard to recover from.

Catastrophes happen. If you deploy something that has a destructive migration you can't easily roll back without reverting to a backup and there's a problem that you've not seen in QA then you're in for a bad time. This is compounded if you also discover your backup process hasn't worked properly for a while. If that happens you're facing some serious downtime, and the dilemma of either trying to fix the problem, or trying to rollback to the last working backup.

There's a good reason why grumpy old devs like me insist on writing docs, having playbooks, testing everything including non-code stuff, and we still fear major deploys. I have scars from exactly those sorts of disasters.

Hopefully the devs at Circle get past this with as little stress as possible, and they learn from what went wrong.

1 comments

ismayilzadan 1535 days ago

You are totally right, catastrophes happen and I also wish that they get past this with as little stress as possible. The whole reason of my assumption was the lack of description in their updates for the incident that is going on for 6 hours. Maybe little more detail would give me a hint that everything is under control, but I didn't feel that when I read their updates.

link

plumefar 1535 days ago

When facing such large scale issues, communicating properly is very hard: Several teams might be investigating several possible root causes in parallel, and you might change your mind over time as to what is the most probable root cause.

So you might end up communicating something ("we think it comes from X, we're fixing it that way"), just to find yourself changing you mind a few minutes later.

Changing your message is usually not well perceived, even though that's actually normal during an investigation.

I would not like to be in charge of the communication. Finding the balance between saying too much or too little is tricky.

link

a1445c8b 1535 days ago

It’s probably just because they had the choice of either focusing all their energy on fixing the problem asap or setting aside some of it to write a more detailed description that’s also fit for public consumption. Given the severity, they probably chose the former since whatever descriptive, reassuring description they put out there isn’t going to be actionable anyway.

link

z00b 1535 days ago

Hi folks. As the CircleCI CTO, I appreciate your patience here and all the feedback. It's true that we are focused on getting customers moving again over sharing more detailed information, but will aim to do better in providing a bit more in our updates. status.circleci.com provides real time updates for both how we're tackling outages and more detailed incident reports. We will post more information there about this incident once we are on the other side and have comprehensive detail.

link