| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by guidopallemans 1726 days ago

He just deleted all his updates.

user:

https://old.reddit.com/user/ramenporn

some messages:

* This is a global outage for all FB-related services/infra (source: I'm currently on the recovery/investigation team).

* Will try to provide any important/interesting bits as I see them. There is a ton of stuff flying around right now and like 7 separate discussion channels and video calls.

* Update 1440 UTC: \

    As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC).

    There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified.

    Part of this is also due to lower staffing in data centers due to pandemic measures.

5 comments

Narushia 1726 days ago

The 1440 UTC update is also archived on the Wayback Machine: https://web.archive.org/web/20211004171424/https://old.reddi...

And archive.today: https://archive.ph/sMgCi

link

yholio 1726 days ago

Essentially, they locked themselves out with an uninspired command line at the exact moment the datacenter was being hijacked by ape-people.

Yup, corporate comms won't love these status updates.

link

wtf-is-ur-prblm 1726 days ago

Sorry, are you referring to data center technicians as “ape people”?

link

z-nexx 1726 days ago

As a former data center technician, I wouldn't say it's too far off

link

ticklemyelmo 1726 days ago

But we're all ape people.

link

samstave 1726 days ago

https://i.imgur.com/O4yEget.png

link

korethr 1726 days ago

I mean, when I last worked in a NOC, we used to call ourselves "NOC monkeys", so yeah. IF you're in the NOC, you're a NOC monkey, if you're on the floor, you're a floor monkey. And so on.

link

meowface 1724 days ago

Same with "SOC monkeys". (Which carries the additional pun of sounding like the "sock monkey" toy.)

link

samstave 1726 days ago

Are you fucking kidding me?

We even had a site and operation for a long while called:

"NOC MONKEY .DOT ORG"

We called all of ourselves NOC MONKEYS. [[Remote Hands]]

Yeah, that was a term used widely.

I'm 46. I assume you are < #

---

Where were you in 1997 building out the very first XML implementations to replace EDI from AS400s to FTP EDI file retrievals via some of the first Linux FTP servers based in SV?

I was there? Remember LinuxCare?

link

eska 1726 days ago

Are you ok, Sir?

link

ShamelessC 1725 days ago

Weren't able to get their ego-fill on facebook like normally.

link

Ueland 1726 days ago

And there his account went poof, thanks for archiving.

link

treesknees 1726 days ago

They were quoted on multiple news sites including Ars Technica. I would imagine they were not authorized to post that information. I hope they don't lose their job.

Shareholders and other business leaders I'm sure are much happier reporting this as a series of unfortunate technical failures (which I'm sure is part of it) rather than a company-wide organizational failure. The fact they can't physically badge in the people who know the router configuration speaks to an organization that hasn't actually thought through all its failure modes. People aren't going to like that. It's not uncommon to have the datacenter techs with access and the actual software folks restricted, but that being the reason one of the most popular services in the world has been down for nearly 3 hours now will raise a lot of questions.

Edit: I also hope this doesn't damage prospects for more Work From Home. If they couldn't get anyone who knew the configuration in because they all live a plane ride away from the datacenters, I could see managers being reluctant to have a completely remote team for situations where clearly physical access was needed.

link

jart 1726 days ago

Facebook should have had a panic room.

Operations teams normally have a special room with a secure connection for situations like this, so that production can be controlled in the event of bgp failure, nuclear war, etc. I could see physical presence being an issue if their bgp router depends on something like a crypto module in a locked cage, in which case there's always helicopters.

So if anything, Facebook's labor policies are about to become cooler.

link

samhw 1726 days ago

Yup, it's terrifying how much is ultimately, ultimately dependent on dongles and trust. I used to work at a company with a billion or so in a bank account (obviously a rather special type of account), which was ultimately authorised by three very trusted people who were given dongles.

link

cyberpunk 1726 days ago

What did the dongles do?

link

samhw 1726 days ago

Sorry, I should have been clearer - the dongles controlled access to that bank account. It was a bank account for banks to hold funds in. (Not our real capital reserves, but sort of like a current account / checking account for banks.)

I was friends with one of those people, and I remember a major panic one time when 2 out of 3 dongles went missing. I'm not sure if we ever found out whether it was some kind of physical pen test, or an astonishingly well-planned heist which almost succeeded - or else a genuine, wildly improbable accident.

link

mike_d 1726 days ago

I would be absolutely shocked if they didn't.

The problem is when your networking core goes down, even if you get in via a backup DSL connection or something to the datacenter, you can't get from your jump host to anything else.

link

jart 1726 days ago

It helps if your dsl line is is bridging at layer 2 in the osi model using rotated psks, so it won't be impacted by dns/bgp/auth/routing failures. That's why you need to put it in a panic room.

link

sulam 1726 days ago

That model works great, until you need to ask for permission to go into the office, and the way to get permission is to use internal email and ticketing systems, which are also down.

link

jart 1725 days ago

Operations teams don't need permission from some apparatchik to enter the office when production goes down. If they can't get in, they drill.

link

Sebb767 1726 days ago

> nuclear war

I think you need some convincing to keep your SREs on-site in case of a nuclear war ;)

link

cyberpunk 1726 days ago

Hey, if I can take the kids and there’s food for a decade and a bunker I’m probably in ;)

link

legitster 1726 days ago

I'm not sure why shareholders are lumped in here. A lot of reasons companies do the secret squirrel routine is to hide their incompetence from the shareholders.

link

treesknees 1726 days ago

That is what I meant, although you have lots of executives and chiefs who are also shareholders.

link

polote 1726 days ago

> an organization that hasn't actually thought through all its failure modes

Thinking about any potential things that can happen is impossible

link

depereo 1726 days ago

You don't need to consider 'what if a meteor hit the data centre and also it was made of cocaine'. You do need to think through "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."

link

fragmede 1726 days ago

In a company the size of FaceBook, "everything is turned off" has never happened since before the company was founded 17 years ago. This makes is very hard to be sure you can bring it all back online! Every time you try it, there are going to be additional issues that crop up, and even when you think you've found them all, a new team that you've never heard of before has wedged themselves into the data-center boot-up flow.

The meteor isn't made of cocaine, but four of them hitting at exactly the same time is freakishly improbable. There are other, bigger fish to fry, that we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.

link

depereo 1725 days ago

>we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.

I think that suggests that there were not bigger fish to fry :)

I take your point on priorities, but in a company the size of facebook perhaps a team dedicated to understanding the challenges around 'from scratch' kickstarting of the infrastructure could be funded and part of the BCP planning - this is a good time to have a binder with, if not perfectly up-to-date data, pretty damned good indications of a process to get things working.

link

cesarb 1726 days ago

> "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."

The electricity people have a name for that: black start (https://en.wikipedia.org/wiki/Black_start). It's something they actively plan for, regularly test, and once in a while, have to use in anger.

link

depereo 1725 days ago

It's a process I'm familiar with gaming out. For our infrastructure, we need to discuss and update our plan for this from time to time, from 'getting the generator up and running' through to 'accessing credentials when the secret server is not online' and 'configuring network equipment from scratch'.

link

JabavuAdams 1726 days ago

I love that when you had to think of a random improbable event, you thought of a cocaine meteor. But ... hell YES!

link

radicalbyte 1726 days ago

Luckily you don't need to do that exhaustively: all you have to do is cover the general failure case. What happens when communications fail?

This is something that most people aren't good at naturally, it tends to come from experience.

link

jnwatson 1726 days ago

Right, but imagining that DNS goes down doesn’t take a science fiction author.

link

mynameisvlad 1726 days ago

Of course you can’t think of every potential scenario possible, but an incorrect configuration and rollback should be pretty high in any team’s risk/disaster recovery/failure scenario documentation.

link

philwelch 1726 days ago

This is true, but it's not an excuse for not preparing for the contingencies you can anticipate. You're still going to be clobbered by an unanticipated contingency sooner or later, but when that happens, you don't want to feel like a complete idiot for failing to anticipate a contingency that was obvious even without the benefit of hindsight.

link

deanCommie 1726 days ago

> I hope they don't lose their job.

I hope they do.

#1 it's a clear breach of corporate confidentiality policies. I can say that without knowing anything about Facebook's employment contracts. Posting insider information about internal company technical difficulties is going to be against employment guidelines at any Big Co.

In a situation like this that might seem petty and cagey. But zooming out and looking at the bigger picture, it's first and foremost a SECURITY issue. Revealing internal technical and status updates needs to go through high-level management, security, and LEGAL approvals, lest you expose the company to increased security risk by revealing gaps that do not need to be publicized.

(Aside: This is where someone clever might say "Security by obscurity is not a strategy". It's not the ONLY strategy, but it absolutely is PART of an overall security strategy.)

#2 just purely from a prioritization/management perspective, if this was my employee, I would want them spending their time helping resolve the problem not post about it on reddit. This one is petty, but if you're close enough to the issue to help, then help. And if you're not, don't spread gossip - see #1.

link

samhw 1726 days ago

You're very, very right - and insightful - about the consequences of sharing this information. I agree with you on that. I don't think you're right that firing people is the best approach.

Irrespective of the question of how bad this was, you don't fix things by firing Guy A and hoping that the new hire Guy B will do it better. You fix it by training people. This employee has just undergone some very expensive training, as the old meme goes.

link

deanCommie 1726 days ago

I feel this way about mistakes, and fuckups.

Whoever is responsible for the BGP misconfiguration that caused this should absolutely not be fired, for example.

But training about security, about not revealing confidential information publicly, etc is ubiquitous and frequent at big co's. Of course, everyone daydreams through them and doesn't take it seriously. I think the only way to make people treat it seriously is through enforcement.

link

unethical_ban 1726 days ago

I feel you're thinking through this with a "purely logical" standpoint and not a "reality" standpoint. You're thinking worst case scenario for the CYA management, having more sympathy for the executive managers than for the engineer providing insight to the tech public.

It seems like a fundamental difference of "who gives a shit about corporate" from my side. The level of detail provided isn't going to get nationstates anything they didn't already know.

link

deanCommie 1726 days ago

Yeah but what is the tech public going to do with these insights?

It's not actionable, it's not whistleblowing, it's not triggering civic action, or offering a possible timeline for recovery.

It's pure idle chitchatter.

So yeah, I do give a shit about corporate here.

Disclosure: While I'm an engineer too, I'm also high enough in the ladder that at this point I am more corporate than not. So maybe I'm a stooge and don't even realize it.

link

fragmede 1722 days ago

Facebook, the social media website is used, almost exclusively for 'idle chitchatter', so you may want to avoid working there if your opinion of the user is so low. (Actually, you'll probably fit right in at Facebook.)

It's unclear to me how a 'high enough in the ladder' manager doesn't realize that there's easily dozen people who know the situation intimately but who can't do anything until a dependent system to them is up. "Get back to work" is... the system is down, what do you want them to do, code with a pencil and paper?

ramenporn violated the corporate communication policy, obviously, but the tone and approach for a good manager to an IC that was doing this online isn't to make it about corporate vs them/the team, and in fact, encourage them to do more such communication, just internally. (I'm sure there was a ton of internal communication, the point is to note where ramenporn's communicative energy was coming from, and nurture that, and not destroy that in the process of chiding them for breaking policy.

link

jfrunyon 1726 days ago

> Edit: I also hope this doesn't damage prospects for more Work From Home. If they couldn't get anyone who knew the configuration in because they all live a plane ride away from the datacenters, I could see managers being reluctant to have a completely remote team for situations where clearly physical access was needed.

You're conflating working remotely ("a plane ride away") and working from home.

You're also conflating the people who are responsible network configuration, and for coming up with a plan to fix this; and the people who are responsible for physically interacting with systems. Regardless of WFH those two sets likely have no overlap at a company the size of Facebook.

link

_joel 1726 days ago

There could be something in the contract that requires all community interaction to go via PR official channels.

It's innocous enough, but leaking info, no matter what, will be a problem if it's stated in their contract.

link

htrp 1726 days ago

100%! comms will want to proof any statement made by anybody along with legal to ensure that there is no D&O liability for sec fraud.

link

rusk 1726 days ago

> an organization that hasn't actually thought through all its failure modes

Move Fast and Break Things!

link

keithnoizu 1726 days ago

I came here to move fast and break things, and i'm all out of move fast.

link

avs733 1725 days ago

In their defense they really lived up to their mission statement today.

link

projectazorian 1726 days ago

I doubt WFH will be impacted by this - not an insider but seems unlikely that the relevant people were on-site at data centers before COVID

link

vineyardmike 1726 days ago

> I doubt WFH will be impacted by this - not an insider but seems unlikely that the relevant people were on-site at data centers before COVID

I think the issue is less "were the right people in the data center" and more "we have no way to contact our co-workers once the internal infrastructure goes down". In non-wfh you physically walk to your co-workers desk and say "hey, fb messenger is down and we should chat, what's your number?". This proves that self-hosting your infra (1) is dangerous and (2) makes you susceptible to super-failures if comms goes down during WFH.

Major tech companies (GAFAM+) all self-host and use internal tools so they're all at risk of this sort of comms breakdown. I know I don't have any co-workers number (except one from WhatsApp which if I worked at FB wouldn't be useful now).

link

saagarjha 1725 days ago

Apple is all on Slack.

link

vineyardmike 1725 days ago

But is it a publicly hosted slack, or does apple host it themselves?

link

practice9 1726 days ago

Most of the stuff was probably implemented before COVID anyways.

They will fix the issue and add more redundant communication channels, which is either an improvement or a non-event for WFH.

And Zuck is slowly moving (dogfooding) company culture to remote too with their Quest work app experiments

link

fanbelt 1726 days ago

They must have been moving very fast!

link

rStar 1726 days ago

shoestring budget on a billion dollar product. you get what you deserve.

link

swayson 1726 days ago

> I hope they don't lose their job.

FB has such poor integrity, I'd not be surprised if they take such extreme measures.

link

kukx 1726 days ago

It is a matter of preparation. You can make sure there are KVMoIPs or other OOB technologies available on site to allow direct access from a remote location. In the worst case technician has to know how to connect the OOB device or press a power button ;)

link

treesknees 1726 days ago

I'm not disagreeing with you, however clearly (if the reddit posts were legitimate) some portion of their OOB/DR procedure depended on a system that's down. From old coworkers who are at FB, their internal DNS and logins are down. It's possible that the username/password/IP of an OOB KVM device is stored in some database that they can't login to. And the fact FB has been down for nearly 4 hours now suggests it's not as simple as plugging in a KVM.

link

kukx 1726 days ago

I was referring to the WFH aspect the parent post mentioned. My point was that the admins could get the same level of access as if they were physically on site, assuming the correct setup.

link

harias 1726 days ago

Pushshift maintains archives of Reddit. You can use camas reddit search to view them.

Comments by u/ramenporn: https://camas.github.io/reddit-search/#{%22author%22:%22rame...

link

tornato7 1726 days ago

PushShift is one of the most amazing resources out the for social media data and more people should know about it

link

madars 1726 days ago

Can you recommend similar others (or maybe how to find them)? I learned of PushShift because snew, an alternative reddit frontend showing deleted comments, was making fetch requests and I had to whitelist it in uMatrix. Did not know about Camas until today.

link

rodgerd 1726 days ago

If it was actually someone in Facebook, their job is gone by now, too.

link

dschiavu 1726 days ago

It's time to decentralize and open up the Internet again, as it once was (ie. IRC, NNTP and other open protocols) instead of relying on commercial entities (Google, Facebook, Amazon) to control our data and access to it.

link

mlindner 1726 days ago

I'll throw in Discord into that mix, the thing that basically mostly killed IRC. Which is yet again centralized despite pretending that it is not centralized.

link

meragrin_ 1726 days ago

The account has been deleted as well.

link

DaiPlusPlus 1726 days ago

What are they afraid of? While they are sharing information that's internal/proprietary to the company, it isn't anything particularly sensitive and having some transparency into the problem is good for everyone.

Who'd want to work for a company that might take disciplinary action because an SRE posted a reddit comment to basically say "BGP's down lol" - If I was in charge I'd give them a modest EOY bonus for being helpful in their outreach to my users in the wider community.

link

handmodel 1726 days ago

Seems reasonable that at a company of 60k, with hundreds who specialize in PR, you do not want a random engineer making the choice himself to be the first to talk to the press by giving a PR conference on a random forum.

link

OskarS 1726 days ago

Honestly, from a PR perspective, I’m not sure it’s so bad. Giving honest updates showing Facebook hard at work is certainly better PR for our kind of crowd than whatever actual Facebook PR is doing.

link

ALittleLight 1726 days ago

That one guy's comments seen fine from a PR perspective apart from it not being his role to communicate for the company.

I still think he should be fired for this kind of communication though. One reason is, imagine Facebook didn't punish breaches of this type. Every other employee is going to be thinking "Cool, I could be in a Wired article" or whatever. All they have to do is give sensitive company information to reporters.

Either you take corporate confidentiality seriously or you don't. Posting details of a crisis in progress on your Reddit account is not taking corporate confidentiality seriously. If the Facebook corporation lightly punishes, scolds, or ignores this person then the corporation isn't taking confidentiality seriously either.

link

confiq 1726 days ago

I agree, but try to explain that to PR people...

link

ballenf 1726 days ago

It's terrible PR for the FB PR team's performance.

link

staticassertion 1726 days ago

Reporters are going to opportunistically start writing about those comments vs having to wait for a controlled message from a communications team. So the reddit posts might not be "so bad", but they're also early and preempting any narrative they may want to control.

link

mike_d 1726 days ago

You falsely assume Hacker News is even remotely what Facebook PR gives a shit about.

link

orangepanda 1726 days ago

That was their best PR in years

link

Animats 1726 days ago

Compare Facebook's official tweet: "We’re aware that some people are having trouble accessing our apps and products. We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience."

That's the PR team, clueless.

link

HelloNurse 1726 days ago

I don't think Facebook could actually say anything more accurate or more honest. "Everything is dead, we are unable to recover, and we are violently ashamed" would be a more fun statement, but not a more useful one.

There will be plenty of time to blame someone, share technical lessons, fire a few departments, attempt to convince the public it won't happen again, and so on.

link

tornato7 1726 days ago

Facebook has never been open and honest about anything, no reason to think they would start now.

link

tornato7 1726 days ago

To be fair, Facebook has never been open and honest about anything.

link

ric2b 1726 days ago

Facebook is well known for having really good PR, if they go after this guy for sharing such basic info that's yet another example of their great PR teams.

link

no_time 1726 days ago

These few sentences were a better and more meaningful read than what hundreds of PR people could ever come up with

link

ptero 1726 days ago

A few random guesses (I am not in any way affiliated with FB); just my 2c:

Sharing status of an active event may complicate recovery, especially if they suspect adversarial actions: such public real-time reports can explain to the red team what the blue team is doing and, especially important, what the blue team is unable to do at the moment.

Potentially exposing the dirty laundry. While a postmortem should be done within the company (and as much as possible is published publicly) after the event, such early blurbs may expose many non-public things, usually unrelated to the issue.

link

treesknees 1726 days ago

Mentioned in another reply

link

birdman3131 1726 days ago

I did not read it as they can't get them on site but rather that it takes travel to get them on site. Travel takes time of which they desperately want not to spend.

link

kelnos 1726 days ago

> If I was in charge I'd give them a modest EOY bonus for being helpful in their outreach to my users in the wider community.

That seems pretty unlikely at any but the smallest of companies. Most companies unify all external communications through some kind of PR department. In those cases usually employees are expressly prohibited from making any public comments about the company without approval.

link

cronix 1726 days ago

> What are they afraid of?

Zuckerberg Loses $7 Billion in Hours as Facebook Plunges

https://finance.yahoo.com/news/zuckerberg-loses-7-billion-ho...

Stop the hemorrhaging. Too much bad press for FB lately and it all adds up.

link

pythonaut_16 1726 days ago

Unrelated to the outage, but I hate headlines like this.

Facebook is down ~5% today. That's a huge plunge to be sure, but Zuckerberg hasn't "lost" anything. He owns the same number of shares today as he did yesterday. And in all likelihood, unless something truly catastrophic happens the share price will bounce back fairly quickly. The only reason he even appears to have lost $7 billion is because he owns so much Facebook stock.

These types of alarmist headlines are inane.

link

Denvercoder9 1726 days ago

Unlikely to be related. FB's losses today already happened before FB went down, and are most likely related to the general negative sentiment in the market today, and the whistleblower documents. It's actually kind of remarkable how little impact the outage had on the stock.

link

robjan 1725 days ago

There was no permanent damage to Facebook as a result of the outage so it's understandable that the stock price wasn't really affected by it

link

motoxpro 1726 days ago

I was thinking the same...

link

jaywalk 1726 days ago

As much as all of the curious techies here would love transparency into the problem, that doesn't actually do any good for Facebook (or anyone else) at the moment. Once everything is back online, making a full RCA available would do actual good for everyone. But I wouldn't hold my breath for that.

link

projectazorian 1726 days ago

FB takes confidentiality very seriously. He crossed a major red line.

link

KaiserPro 1725 days ago

They got told, explicitly that they shouldn't be sharing updates from the outage meeting, in the outage meeting.

link

minusSeven 1726 days ago

Do we even know if someone had the account deleted? I think facebook might have their hands full right now solving the issue rather than looking at social media posts that discusses the issue.

link

kelnos 1726 days ago

There are a lot of people who work at Facebook, and I'm sure the people responsible for policing external comms do not have the skills or access to fix what's wrong right now.

link

_kst_ 1726 days ago

Assuming that Facebook forced the account to be deleted, it wouldn't have been done by anyone who's working on fixing the problem.

link