| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 1970-01-01 238 days ago
	I remember Facebook had a similar story when they botched their BGP update and couldn't even access the vault. If you have circular auth, you don't have anything when somebody breaks DNS.

2 comments

crote 238 days ago

Wasn't there an issue where they required physical access to the data center to fix the network, which meant having to tap in with a keycard to get in, which didn't work because the keycard server was down, due to the network being down?

link

kylecazar 238 days ago

Wishful thinking, but I hope an engineer somewhere got to ram a door down to fix a global outage. For the stories.

link

jedberg 238 days ago

Way back when I worked at eBay, we once had a major outage and needed datacenter access. The datacenter process normally took about 5 minutes per person to verify identity and employment, and then scan past the biometric scanners.

On that day, the VP showed up and told the security staff, "just open all the doors!". So they did. If you knew where the datacenter was, you could just walk-in in mess with eBay servers. But since we were still a small ops team, we pretty much knew everyone who was supposed to be there. So security was basically "does someone else recognize you?".

link

terminalshort 238 days ago

> So security was basically "does someone else recognize you?"

I actually can't think of a more secure protocol. Doesn't scale, though.

link

goatking 237 days ago

Well, you put a lot of trust in the individuals in this case. A disgruntled employee can just let the bad guys in on purpose, saying "Yes they belong here".

link

terminalshort 237 days ago

That works until they run into a second person. In a big corp where people don't recognize each other you can also let the bad guys in, and once they're in nobody thinks twice about it.

link

0x457 237 days ago

Vulnerable to byzantine fault.

link

ermis 237 days ago

or it could be some troy maybe.

link

peterbecich 237 days ago

I would imagine this is how it works for the President and Cabinet

link

chasd00 238 days ago

way back when DC's were secure but not _that secure_ i social engineered my way close enough to our rack without ID to hit a reset button before getting thrown out.

/those were the days

link

bandrami 237 days ago

Oh I've definitely done that. They had remote hands but we were over our rack limit and we didn't want them to see inside.

The early oughts were a different time.

link

jedberg 238 days ago

Just to test the security, or...?

link

chasd00 237 days ago

late reply but, no, i really needed to hit the button but didn't have valid ID at the time. My driver's license was expired and i couldn't get it renewed because of a outstanding tickets iirc. I was able to talk my way in and had been there many times before so knew my way around and what words to say. I was able to do what i needed before another admin came up and told me that without valid ID they have no choice but to ask me to leave (probably like an insurance thing). I was being a bit dramatic when i said "getting thrown out" the datacenter guys were very nice and almost apologetic about asking me to leave.

link

UltraSane 238 days ago

I was in a datacenter when the fire alarm went off and all door locks were automatically disabled.

link

anthonyeden 237 days ago

Most modern commercial buildings in Australia unlock doors when the fire alarm goes off.

link

johnisgood 237 days ago

Lmao, so unathorized access on demand by pulling the fire alarm?

link

chasd00 237 days ago

There's some computer lore out there about someone tripping a fire alarm by accident or some other event that triggered a gas system used to put out fires without water but isn't exactly compatible with life. The story goes some poor sys admin had to stand there with their finger on like a pause button until the fire department showed up to disarm the system. If they released the button the gas would flood the whole DC.

link

UltraSane 237 days ago

Essentially yes. They should really divide data centers into zones and only unlock doors inside a zone where smoke is detected.

link

folmar 237 days ago

Don't ask about fire power switch

link

E39M5S62 238 days ago

That sounds like an Equinix datacenter. They were painfully slow at 350 E. Cermak.

link

jedberg 238 days ago

It wasn't Equinix, but I think the vendor was acquired by them. I don't actually blame them, I appreciated their security procedures. The five minutes usually didn't matter.

link

wolpoli 238 days ago

The story was that they had to use an angle grinder to get in.

link

jonbiggums22 238 days ago

I remember hearing Google early in it's history had some sort of emergency back up codes that they encased in concrete to prevent them becoming a casual part of the process and they needed a jack hammer and a couple hours when the supposedly impossible happened after only a couple years.

link

dgl 238 days ago

Not quite; you're probably thinking of: https://google.github.io/building-secure-and-reliable-system...

link

brazzy 237 days ago

> To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.

Classic.

In my first job I worked on ATM software, and we had a big basement room full of ATMs for test purposes. The part the money is stored in is a modified safe, usually with a traditional dial lock. On the inside of one of them I saw the instructions on how to change the combination. The final instruction was: "Write down the combination and store it safely", then printed in bold: "Not inside the safe!"

link

gofreddygo 233 days ago

> It took an additional hour for the team to realize that the green light on the smart card reader did not, in fact, indicate that the card had been inserted correctly. When the engineers flipped the card over, the service restarted and the outage ended.

awesome !

link

paranoidrobot 238 days ago

That's a wonderful read, thanks for that.

link

prepend 238 days ago

This is how John Wick did it. He buried his gold and weapons in his garage and poured concrete over it.

link

selcuka 238 days ago

It only worked for Wick because he is a man of focus, commitment, and sheer will.

link

6510 237 days ago

This is the way.

There is a video from the lock pick lawyer where he receives a padlock in the mail with so much tape that it takes him whole minutes to unpack.

Concrete is nice, other options are piles of soil or brick in front of the door. There probably is a sweet spot where enough concrete slows down an excavator and enough bricks mixed in the soil slows down the shovel. Extra points if there is no place nearby to dump the rubble.

link

jasonwatkinspdx 237 days ago

Probably one of those lost in translation or gradual exaggeration stories.

If you just wanted recovery keys that were secure from being used in an ordinary way you can use Shamir to split the key over a couple hard copies stored in safety deposit boxes a couple different locations.

link

hshdhdhehd 238 days ago

Louvre gang decides they can make more money contracting to AWS.

link

SoftTalker 238 days ago

The Data center I’m familiar with uses cards and biometrics but every door also has a standard key override. Not sure who opens the safe with the keys but that’s the fallback in case the electronic locks fail.

link

bombcar 238 days ago

I prefer to use a sawzall and just go through the wall.

link

adrianmonk 238 days ago

The memory is hazy since it was 15+ years ago, but I'm fairly sure I knew someone who worked at a company whose servers were stolen this way.

The thieves had access to the office building but not the server room. They realized the server room shared a wall with a room that they did have access to, so they just used a sawzall to make an additional entrance.

link

chasd00 237 days ago

my across the street neighbor had some expensive bikes stolen this way. The thieves just cut a hole in the side of their garage from the alley, security cameras were facing the driveway and with nothing on the alley side. We (the neighborhood) think they were targeted specifically for the bikes as nothing else was stolen and your average crack head isn't going to make that level of effort.

link

oblio 237 days ago

That would be a sawswall, in that case.

link

bluGill 238 days ago

I assume they needed their own air supply because the automatic poison gas system was activating. Then they had to dodge lazers to get to the one button that would stop the nuclear missle launch.

add a bunch of other poinless scifi and evil villan lair tropes in as well...

link

donalhunt 238 days ago

Most datacenters are fairly boring to be honest. The most exciting thing likely to happen is some sheet metal ripping your hand open because you didn't wear gloves.

Still have my "my other datacenter is made of razorblades and hate" sticker. \o/

link

formerly_proven 238 days ago

They do commonly have poisonous gas though.

link

christkv 237 days ago

I had a summer job at a hospital one year in the data center when an electrician managed to trigger the halon system and we all had to evacuate and wait for the process to finish and the gas to vent. The four firetrucks and station master who shoved up was both annoyed and relieved it was not real.

link

maaaaattttt 238 days ago

Not sure if you’re joking but a relatively small datacenter I’m familiar with has reduced oxygen in it to prevent fires. If you were to break in unannounced you would faint or maybe worse (?).

link

mrgoldenbrown 238 days ago

Halon was used back in the day for fire suppression but I thought it was only dangerous at high enough concentrations to suffocate you by displacing oxygen.

link

UltraSane 238 days ago

No FM200 isn't poisonous.

link

ArnoVW 238 days ago

And lasers come to think of it

link

tacticus 238 days ago

there are datacentres not made of razorblades and hate?

link

lazide 237 days ago

Not an active datacenter, but I did get to use a fire extinguisher to knock out a metal-mesh-reinforced window in a secure building once because no one knew where the keys were for an important room.

Management was not happy, but I didn’t get in trouble for it. And yes, it was awesome. Surprisingly easy, especially since the fire extinguisher was literally right next to it.

link

geephroh 237 days ago

Sometimes a little good old fashioned mayhem is good for employee morale

link

lazide 237 days ago

Every good firefighter knows this feeling.

Nothing says ‘go ahead, destroy that shit’ like money going up in smoke if you don’t.

P.S. don’t park in front of fire hydrants, because they will have a shit eating grin on their face when they destroy your car- ahem - clear the obstacle - when they need to use it to stop a fire.

link

lenerdenator 238 days ago

Not to speak for the other poster, but yes, they had people experiencing difficulties getting into the data centers to fix the problems.

I remember seeing a meme for a cover of "Meta Data Center Simulator 2021" where hands were holding an angle grinder with rows of server racks in the background.

"Meta Data Center Simulator 2021: As Real As It Gets (TM)"

link

UltraSane 238 days ago

Yes for some insane reason facebook had EVERYTHING on a single network. The door access not working when you lose BGP routes is especially bad because normal door access systems cache access rules on the local door controllers and thus still work when they lose connectivity to the central server.

link

holowoodman 238 days ago

Depends. Some have a paranoid mode without caching, because then a physical attacker cannot snip a cable and then use a stolen keycard as easily or something. We had an audit force us to disable caching, which promptly went south at a power outage 2 months later where the electricians couldn't get into the switch room anymore. The door was easy to overcome, however, just a little fiddling with a credit card, no heroic hydraulic press story ;)

link

jordanb 237 days ago

Auditors made you disable credential caching but missed the door that could be shimmed open..

link

AbstractH24 237 days ago

Sounds like they earned their fee!

link

UltraSane 237 days ago

If you aren't going to cache locally than you need redundant access to the server like LTE access and plan for needing to unlock the doors if you lose access to the server.

link

avidphantasm 238 days ago

This sounds similar to AWS services depending on DynamoDB, which sounds like what happened here. Even if under the hood parts of AWS depend on Dynamo, it should be a walled-off instance separate from Dynamo available via us-east-1.

link

UltraSane 238 days ago

There should be many more smaller instances with smaller blast radius.

link

junon 238 days ago

Yep. And their internal comms were on the same server if memory serves. They were also down.

link

simplyluke 238 days ago

I was there at the time, for anyone outside of the core networking teams it was functionally a snow day. I had my manager's phone number, and basically established that everyone was in the same boat and went to the park.

Core services teams had backup communication systems in place prior to that though. IIRC it was a private IRC on separate infra specifically for that type of scenario.

link

prmoustache 238 days ago

I remember working for a company who insisted all teams had to usr whatever corp instant messaging/chat app but our sysadmin+network team maintained a jabber server + a bunch of core documentation synchronized on a vps in a totally different infrastructure just in case and sure enough there was that a day it came handy.

link

DevelopingElk 238 days ago

AWS, for the ultimate backup, relies on a phone call bridge on the public phone network.

link

gregw2 238 days ago

Ah, but have they verified how far down the turtles go, and has that changed since they verified it?

In the mid-2000s most of the conference call traffic started leaving copper T1s and going onto fiber and/or SIP switches managed by Level3, Global Crossing, Qwest, etc. Those companies combined over time into Century Link which was then rebranded Lumen.

As of last October, Lumen is now starting to integrate more closely with AWS, managing their network with AWS's AI: https://convergedigest.com/lumen-expands-fiber-network-to-su...

"Oh what a tangled web we weave..."

link

wbl 237 days ago

I once suggested at work that we list diesel distributors using payment infra not on on us near our datacenters.

link

junon 238 days ago

Thanks for the correction, that sounds right. I thought I had remembered IRC but wasn't sure.

link

bcrl 238 days ago

That's similar to the total outage of all Rogers services in Canada back on July 7th 2022. It was compounded by the fact that the outage took out all Rogers cell phone service, making it impossible for Rogers employees to communicate with each other during the outage. A unified network means a unified failure mode.

Thankfully none of my 10 Gbps wavelengths were impacted. Oh did I appreciate my aversion to >= layer 2 services in my transport network!

link

YokoZar 238 days ago

That's kind of a weird ops story, since SRE 101 for oncall is to not rely on the system you're oncall for to resolve outages in it. This means if you're oncall for communications of some kind, you must have some other independent means of reaching eachother (even if it's a competitor phone network)

link

bcrl 238 days ago

That is heavily contingent on the assumption that the dependencies between services are well documented and understood by the people building the systems.

link

YokoZar 236 days ago

Are you asserting that Rogers employees needed documentation to know that Rogers Wireless runs on Rogers systems?

link

bcrl 236 days ago

Rogers is perhaps best described as a confederacy of independent acquisitions. In working with their sales team, I have had to tell them where there facilities are as the sales engineers don't always know about all of the assets that Rogers owns.

There's also the insistence that Rogers employees should use Rogers services. Paying for every Rogers employee to have Bell cell phone would not sit well with their executives.

That the risk assessments of the changes being made to the router configuration were incorrect also contributed to the outage.

link