Hacker News new | ask | show | jobs
by tacLog 1614 days ago
Wow, that makes complete sense for something that is impacting this many people and by extension lots of money.

Thanks for the answer, I have only ever worked with such a small team that we are all on a call every day.

I can imagine it can probably get a little hectic in large group calls? On the engineering side is there a command structure? Like say the root cause was found and RC team is rushing to fix it. But another team wants to mitigate in the mean time in a slightly risky way. Would their manager make a case with leadership? Would the proposed plan just be put out for general comment as a response to that main ticket?

3 comments

It depends. I’ve managed major incidents with hundreds of participants.

Our major incident process generally had a “suit” call with non-technical executives and people who would be coordinating customer triage, outreach, etc. Then we would have a tech bridge where the key stakeholders did their thing.

We used the Federal incident command system as a model. It’s a great reference point to use as an inspiration.

Any guides on the "Federal incident command system" to read from (e.g. without blindly googling for it). Thanks?
In addition, you can look into ITIL/ITSM Incident Management plans, they have well developed process structure to work from as a guideline.

I have also seen organizations recommend Kepner Tregoe method training for real time high pressure problem solving based off Nasa Mission Control systems.

https://training.fema.gov/nims/ is a great entry point.
each company is different, from my experience it would depend on the severity of the fix, and the severity of the issue. the problem would get resolved by any means ie temporary sticky plaster if necessary.

Another team would then assess and analyse the root cause from a company wide perspective and then assess the risks, costs and impact and then make any modifications (possibly redoing the temporary fix, and fixing it properly)

Real issue, a call center main telephony system and one of the management servers kept crashing causing over 1400 call center people to stop working. Temporary fix was to re boot the servers every 4 hours causing minor pain, but the call staff was up and running.

After a whole stupid week of the engineers not being able to find the route cause it was escalated extremely high and our team was brought in and we found the root cause in seconds (literally)The servers was VMs and the engineers hadn't checked the physical ESX server they were hosted on. another VM on the box caused the server to go unstable (ESX not configured correctly).

BAU project set up to audit/ report and fix all the ESX servers in the company for other stupid config issues

The person you're responding to is not exactly wrong. But since the users dropped to 0 pretty quickly it's likely that every team with any monitoring at all got paged. At least that's what would happen at the moderately large company I work for.
I'm giving a much broader example of what a large company might do for high impact events. I have no idea what the insides of Roblox look like specifically.
Not to mention a VP or three. A well-led company is going to have management in the line of fire, so to speak, so an outage of this scale would wake them as well.