| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by john_cogs 1603 days ago
	GitLab team member here. We're aware of the incident and the status page has been updated. We will provide further updates on the status page as they become available. (Edited now that the status page has been updated).

3 comments

totaldude87 1603 days ago

Awesome, thank you! Godspeed!

link

hackerlytest 1603 days ago

Thanks. Seems to be back for me.

link

Dayshine 1603 days ago

It took (by my measure) 13 minutes for a full outage to be represented on the status page.

I was under the impression that gitlab use gitlab.com for their work. Surely someone would have noticed within seconds that it was down?

Why have the misleading "updated a few seconds" ago text if it doesn't update on complete failure? :)

link

john_cogs 1603 days ago

Your impression is correct. We use GitLab.com and notice these incidents as they happen.

The delay in updating status is a result of our Incident Management process [0]. We have a Communications Manager on Call (CMOC) who leads communication throughout an incident. One of their responsibilities includes updating the status page. The slight delay between noticing the issue and updating the status page is a result of the time it takes for the CMOC to get alerted, assess the situation, and write the communication that is shared on the status page.

I'm not sure how the "updated a few seconds ago" messages are generated but I'll try to find out once the incident has been resolved.

0 - https://about.gitlab.com/handbook/engineering/infrastructure...

link

linuxdeveloper 1603 days ago

Why is "Active Incident" and "System Wide Outage" on the status page with a background color of green? Why not red?

At first glance it looks like everything is operational with no issues.

link

john_cogs 1603 days ago

The color is green because GitLab.com is accessible again.

"Active Incident" remains because our team is still working towards full recovery.

"System Wide Outage" is the description of the incident at its onset.

link

caioariede 1603 days ago

I noticed that too. Pretty confusing to read "Operational" with a green background and "System Wide Outage" on the left side.

link

rasz 1603 days ago

Not a "status page" then, but merely "a page where Communications Manager post messages on after assessing the situation and consulting/getting permission from management"

link

Lutger 1599 days ago

Why? Because there's a human decision involved?

link

vasco 1603 days ago

After you notice I assume you have to declare an incident, get a call going, assess the extent of the issues, get the needed people involved, and then you'd announce on the status page. 13 minutes isn't amazing but it also isn't terrible. Perhaps you have better ways of keeping status pages updated much faster while also not ending up ramping up the posting of false positives.

link

aejnsn 1603 days ago

13 minutes is pretty solid compared to anything of recent AWS outages.

link

pixl97 1603 days ago

It doesn't matter if each individual detects the outage because they'll start blame at the local source and move further up the tree rather than assign blame to a full system failure right off the bat. 99.9% of the time it's going to be a local failure affecting the individual.

Also, most alerting systems like check multiple times before declaring a public outage, many times 2 to 3 failures some seconds apart are needed.

link

voidfunc 1603 days ago

To add onto this, my experience is you never want a fully automated status page for another two reasons:

1. External engineers will start to automate recovery/mitigation processes around your status page if it has real time status.

2. You now need to bug test your status page thoroughly because of #1. It basically becomes an actual API.

link

10000truths 1603 days ago

That sounds like a problem for the external engineers, not for GitLab.

link

pixl97 1603 days ago

Users make their problems yours.

link