Hacker News new | ask | show | jobs
by 1970-01-01 238 days ago
I remember Facebook had a similar story when they botched their BGP update and couldn't even access the vault. If you have circular auth, you don't have anything when somebody breaks DNS.
2 comments

Wasn't there an issue where they required physical access to the data center to fix the network, which meant having to tap in with a keycard to get in, which didn't work because the keycard server was down, due to the network being down?
Wishful thinking, but I hope an engineer somewhere got to ram a door down to fix a global outage. For the stories.
Way back when I worked at eBay, we once had a major outage and needed datacenter access. The datacenter process normally took about 5 minutes per person to verify identity and employment, and then scan past the biometric scanners.

On that day, the VP showed up and told the security staff, "just open all the doors!". So they did. If you knew where the datacenter was, you could just walk-in in mess with eBay servers. But since we were still a small ops team, we pretty much knew everyone who was supposed to be there. So security was basically "does someone else recognize you?".

> So security was basically "does someone else recognize you?"

I actually can't think of a more secure protocol. Doesn't scale, though.

Well, you put a lot of trust in the individuals in this case. A disgruntled employee can just let the bad guys in on purpose, saying "Yes they belong here".
That works until they run into a second person. In a big corp where people don't recognize each other you can also let the bad guys in, and once they're in nobody thinks twice about it.
Vulnerable to byzantine fault.
or it could be some troy maybe.
I would imagine this is how it works for the President and Cabinet
way back when DC's were secure but not _that secure_ i social engineered my way close enough to our rack without ID to hit a reset button before getting thrown out.

/those were the days

Oh I've definitely done that. They had remote hands but we were over our rack limit and we didn't want them to see inside.

The early oughts were a different time.

Just to test the security, or...?
late reply but, no, i really needed to hit the button but didn't have valid ID at the time. My driver's license was expired and i couldn't get it renewed because of a outstanding tickets iirc. I was able to talk my way in and had been there many times before so knew my way around and what words to say. I was able to do what i needed before another admin came up and told me that without valid ID they have no choice but to ask me to leave (probably like an insurance thing). I was being a bit dramatic when i said "getting thrown out" the datacenter guys were very nice and almost apologetic about asking me to leave.
I was in a datacenter when the fire alarm went off and all door locks were automatically disabled.
Most modern commercial buildings in Australia unlock doors when the fire alarm goes off.
Lmao, so unathorized access on demand by pulling the fire alarm?
There's some computer lore out there about someone tripping a fire alarm by accident or some other event that triggered a gas system used to put out fires without water but isn't exactly compatible with life. The story goes some poor sys admin had to stand there with their finger on like a pause button until the fire department showed up to disarm the system. If they released the button the gas would flood the whole DC.
Essentially yes. They should really divide data centers into zones and only unlock doors inside a zone where smoke is detected.
Don't ask about fire power switch
That sounds like an Equinix datacenter. They were painfully slow at 350 E. Cermak.
It wasn't Equinix, but I think the vendor was acquired by them. I don't actually blame them, I appreciated their security procedures. The five minutes usually didn't matter.
The story was that they had to use an angle grinder to get in.
I remember hearing Google early in it's history had some sort of emergency back up codes that they encased in concrete to prevent them becoming a casual part of the process and they needed a jack hammer and a couple hours when the supposedly impossible happened after only a couple years.
> To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.

Classic.

In my first job I worked on ATM software, and we had a big basement room full of ATMs for test purposes. The part the money is stored in is a modified safe, usually with a traditional dial lock. On the inside of one of them I saw the instructions on how to change the combination. The final instruction was: "Write down the combination and store it safely", then printed in bold: "Not inside the safe!"

> It took an additional hour for the team to realize that the green light on the smart card reader did not, in fact, indicate that the card had been inserted correctly. When the engineers flipped the card over, the service restarted and the outage ended.

awesome !

That's a wonderful read, thanks for that.
This is how John Wick did it. He buried his gold and weapons in his garage and poured concrete over it.
It only worked for Wick because he is a man of focus, commitment, and sheer will.
This is the way.

There is a video from the lock pick lawyer where he receives a padlock in the mail with so much tape that it takes him whole minutes to unpack.

Concrete is nice, other options are piles of soil or brick in front of the door. There probably is a sweet spot where enough concrete slows down an excavator and enough bricks mixed in the soil slows down the shovel. Extra points if there is no place nearby to dump the rubble.

Probably one of those lost in translation or gradual exaggeration stories.

If you just wanted recovery keys that were secure from being used in an ordinary way you can use Shamir to split the key over a couple hard copies stored in safety deposit boxes a couple different locations.

Louvre gang decides they can make more money contracting to AWS.
The Data center I’m familiar with uses cards and biometrics but every door also has a standard key override. Not sure who opens the safe with the keys but that’s the fallback in case the electronic locks fail.
I prefer to use a sawzall and just go through the wall.
The memory is hazy since it was 15+ years ago, but I'm fairly sure I knew someone who worked at a company whose servers were stolen this way.

The thieves had access to the office building but not the server room. They realized the server room shared a wall with a room that they did have access to, so they just used a sawzall to make an additional entrance.

my across the street neighbor had some expensive bikes stolen this way. The thieves just cut a hole in the side of their garage from the alley, security cameras were facing the driveway and with nothing on the alley side. We (the neighborhood) think they were targeted specifically for the bikes as nothing else was stolen and your average crack head isn't going to make that level of effort.
That would be a sawswall, in that case.
I assume they needed their own air supply because the automatic poison gas system was activating. Then they had to dodge lazers to get to the one button that would stop the nuclear missle launch.

add a bunch of other poinless scifi and evil villan lair tropes in as well...

Most datacenters are fairly boring to be honest. The most exciting thing likely to happen is some sheet metal ripping your hand open because you didn't wear gloves.

Still have my "my other datacenter is made of razorblades and hate" sticker. \o/

They do commonly have poisonous gas though.
I had a summer job at a hospital one year in the data center when an electrician managed to trigger the halon system and we all had to evacuate and wait for the process to finish and the gas to vent. The four firetrucks and station master who shoved up was both annoyed and relieved it was not real.
Not sure if you’re joking but a relatively small datacenter I’m familiar with has reduced oxygen in it to prevent fires. If you were to break in unannounced you would faint or maybe worse (?).
Halon was used back in the day for fire suppression but I thought it was only dangerous at high enough concentrations to suffocate you by displacing oxygen.
No FM200 isn't poisonous.
And lasers come to think of it
there are datacentres not made of razorblades and hate?
Not an active datacenter, but I did get to use a fire extinguisher to knock out a metal-mesh-reinforced window in a secure building once because no one knew where the keys were for an important room.

Management was not happy, but I didn’t get in trouble for it. And yes, it was awesome. Surprisingly easy, especially since the fire extinguisher was literally right next to it.

Sometimes a little good old fashioned mayhem is good for employee morale
Every good firefighter knows this feeling.

Nothing says ‘go ahead, destroy that shit’ like money going up in smoke if you don’t.

P.S. don’t park in front of fire hydrants, because they will have a shit eating grin on their face when they destroy your car- ahem - clear the obstacle - when they need to use it to stop a fire.

Not to speak for the other poster, but yes, they had people experiencing difficulties getting into the data centers to fix the problems.

I remember seeing a meme for a cover of "Meta Data Center Simulator 2021" where hands were holding an angle grinder with rows of server racks in the background.

"Meta Data Center Simulator 2021: As Real As It Gets (TM)"

Yes for some insane reason facebook had EVERYTHING on a single network. The door access not working when you lose BGP routes is especially bad because normal door access systems cache access rules on the local door controllers and thus still work when they lose connectivity to the central server.
Depends. Some have a paranoid mode without caching, because then a physical attacker cannot snip a cable and then use a stolen keycard as easily or something. We had an audit force us to disable caching, which promptly went south at a power outage 2 months later where the electricians couldn't get into the switch room anymore. The door was easy to overcome, however, just a little fiddling with a credit card, no heroic hydraulic press story ;)
Auditors made you disable credential caching but missed the door that could be shimmed open..
Sounds like they earned their fee!
If you aren't going to cache locally than you need redundant access to the server like LTE access and plan for needing to unlock the doors if you lose access to the server.
This sounds similar to AWS services depending on DynamoDB, which sounds like what happened here. Even if under the hood parts of AWS depend on Dynamo, it should be a walled-off instance separate from Dynamo available via us-east-1.
There should be many more smaller instances with smaller blast radius.
Yep. And their internal comms were on the same server if memory serves. They were also down.
I was there at the time, for anyone outside of the core networking teams it was functionally a snow day. I had my manager's phone number, and basically established that everyone was in the same boat and went to the park.

Core services teams had backup communication systems in place prior to that though. IIRC it was a private IRC on separate infra specifically for that type of scenario.

I remember working for a company who insisted all teams had to usr whatever corp instant messaging/chat app but our sysadmin+network team maintained a jabber server + a bunch of core documentation synchronized on a vps in a totally different infrastructure just in case and sure enough there was that a day it came handy.
AWS, for the ultimate backup, relies on a phone call bridge on the public phone network.
Ah, but have they verified how far down the turtles go, and has that changed since they verified it?

In the mid-2000s most of the conference call traffic started leaving copper T1s and going onto fiber and/or SIP switches managed by Level3, Global Crossing, Qwest, etc. Those companies combined over time into Century Link which was then rebranded Lumen.

As of last October, Lumen is now starting to integrate more closely with AWS, managing their network with AWS's AI: https://convergedigest.com/lumen-expands-fiber-network-to-su...

"Oh what a tangled web we weave..."

I once suggested at work that we list diesel distributors using payment infra not on on us near our datacenters.
Thanks for the correction, that sounds right. I thought I had remembered IRC but wasn't sure.
That's similar to the total outage of all Rogers services in Canada back on July 7th 2022. It was compounded by the fact that the outage took out all Rogers cell phone service, making it impossible for Rogers employees to communicate with each other during the outage. A unified network means a unified failure mode.

Thankfully none of my 10 Gbps wavelengths were impacted. Oh did I appreciate my aversion to >= layer 2 services in my transport network!

That's kind of a weird ops story, since SRE 101 for oncall is to not rely on the system you're oncall for to resolve outages in it. This means if you're oncall for communications of some kind, you must have some other independent means of reaching eachother (even if it's a competitor phone network)
That is heavily contingent on the assumption that the dependencies between services are well documented and understood by the people building the systems.
Are you asserting that Rogers employees needed documentation to know that Rogers Wireless runs on Rogers systems?
Rogers is perhaps best described as a confederacy of independent acquisitions. In working with their sales team, I have had to tell them where there facilities are as the sales engineers don't always know about all of the assets that Rogers owns.

There's also the insistence that Rogers employees should use Rogers services. Paying for every Rogers employee to have Bell cell phone would not sit well with their executives.

That the risk assessments of the changes being made to the router configuration were incorrect also contributed to the outage.