Hacker News new | ask | show | jobs
by john_cogs 1603 days ago
GitLab team member here. We're aware of the incident and the status page has been updated. We will provide further updates on the status page as they become available.

(Edited now that the status page has been updated).

3 comments

Awesome, thank you! Godspeed!
Thanks. Seems to be back for me.
It took (by my measure) 13 minutes for a full outage to be represented on the status page.

I was under the impression that gitlab use gitlab.com for their work. Surely someone would have noticed within seconds that it was down?

Why have the misleading "updated a few seconds" ago text if it doesn't update on complete failure? :)

Your impression is correct. We use GitLab.com and notice these incidents as they happen.

The delay in updating status is a result of our Incident Management process [0]. We have a Communications Manager on Call (CMOC) who leads communication throughout an incident. One of their responsibilities includes updating the status page. The slight delay between noticing the issue and updating the status page is a result of the time it takes for the CMOC to get alerted, assess the situation, and write the communication that is shared on the status page.

I'm not sure how the "updated a few seconds ago" messages are generated but I'll try to find out once the incident has been resolved.

0 - https://about.gitlab.com/handbook/engineering/infrastructure...

Why is "Active Incident" and "System Wide Outage" on the status page with a background color of green? Why not red?

At first glance it looks like everything is operational with no issues.

The color is green because GitLab.com is accessible again.

"Active Incident" remains because our team is still working towards full recovery.

"System Wide Outage" is the description of the incident at its onset.

I noticed that too. Pretty confusing to read "Operational" with a green background and "System Wide Outage" on the left side.
Not a "status page" then, but merely "a page where Communications Manager post messages on after assessing the situation and consulting/getting permission from management"
Why? Because there's a human decision involved?
After you notice I assume you have to declare an incident, get a call going, assess the extent of the issues, get the needed people involved, and then you'd announce on the status page. 13 minutes isn't amazing but it also isn't terrible. Perhaps you have better ways of keeping status pages updated much faster while also not ending up ramping up the posting of false positives.
13 minutes is pretty solid compared to anything of recent AWS outages.
It doesn't matter if each individual detects the outage because they'll start blame at the local source and move further up the tree rather than assign blame to a full system failure right off the bat. 99.9% of the time it's going to be a local failure affecting the individual.

Also, most alerting systems like check multiple times before declaring a public outage, many times 2 to 3 failures some seconds apart are needed.

To add onto this, my experience is you never want a fully automated status page for another two reasons:

1. External engineers will start to automate recovery/mitigation processes around your status page if it has real time status.

2. You now need to bug test your status page thoroughly because of #1. It basically becomes an actual API.

That sounds like a problem for the external engineers, not for GitLab.
Users make their problems yours.