Hacker News new | ask | show | jobs
by nemothekid 2693 days ago
Apparently there was a fire in a Datacenter - theres a thread on /r/sysadmin by an insider.

https://np.reddit.com/r/sysadmin/comments/ao4g2y/wells_fargo...

4 comments

Incredible. They had all their mission-critical infrastructure in a single data center. How many billions of dollars do they make per year? And they can't afford even a tiny bit of redundancy?

If I were a customer, I'd use this as a sign that this company is not technically competent enough to manage my money.

If you're waiting for this to be the straw to move away from Wells Fargo after the last set of scandals, you may have some issues...
Yep, the California AG basically said "Wells Fargo is a criminal enterprise masquerading as a corporation" and I'm inclined to believe him.
I can't find a quote like this -- is there any chance you have the source?
“Wells Fargo customers entrusted their bank with their livelihood, their dreams, and their savings for the future,” said Attorney General Becerra. “Instead of safeguarding its customers, Wells Fargo exploited them, signing them up for products - from bank accounts to insurance - that they never wanted. This is an incredible breach of trust that threatens not only the customers who depended on Wells Fargo, but confidence in our banking system. As our investigation found, Wells Fargo’s conduct was unlawful and disgraceful.”

Source: https://oag.ca.gov/news/press-releases/attorney-general-bece...

I heard it on an NPR interview with him mentioning this, but I can't find the transcript.
Unfortunately they're the holders of my mortgage (it was sold twice and they ended up with it). So I don't even have a choice of being their customer for the next ~25 years.
A bank without redundancy is the best place to have your mortgage. Just hope that one day they forget about your debt.
More likely they will forget you paid them.
Always keep your receipts.
Real talk...they own my mortgage, i cant just 'forget about it' if they forget about it. I want to refinance it. Of they lose all data on my debt like that ending to Fight Club... is the debt just...gone?
Nope. Liens are recorded with the county/state as part of property law, as well as any other encumbrances. So that would actually probably make your situation worse because the lien holder would have no clue about the debt and claim on the property and could be hell for you to remove even if you kept paying and eventually paid everything off. Could potentially be a nightmarish scenario and an exercise in pure bureaucracy. Probably some Brazil or Catch-22 style shit.
Nice
This is why I like the Canadian system of mortgages.

I have a 30 year mortgage, but every 5 years I have to "renew" it. At that time, I have to renegotiate the rate for the next 5 years. As part of this negotiation, I can just switch banks if I want. Or to a private lender. Or to anyone really.

It seems weird to me that you are beholden to an entity you've never signed any contract with.

This is not always a good thing. Inability to lock in a reasonable fixed rate for a 15, 20 or 25 year term in Canada meant that in the early 1980s, for example, when benchmark interest rates were 15%+, people whose 5 year terms came up for renewal had no choice but to renew at 17% interest rates. Ask an older person from AB who lived through the first oil economy bust in Edmonton or Calgary.
Is there something to prevent the rate from skyrocketing when you renew the mortgage? Doesn't this have all the issues people occasionally have with ARMs?
Nope. Rates are based off a prime lending rate which is equal amongst all the big banks. But like they said, you are free to shop around.

Credit unions usually have good offers.

Ideally you’re not buying a home that you can’t afford if rates go up too much.

Assuming Canada is like the UK, competition in the mortgage market means a bank hiking the rates will lose your business. Of course you can be unlucky if interest rates are unusually high at the time your fixed rate deal is up for renewal.
Contact you entered into was "assignable"
Refi that bad-boy to your local credit union
Can't. It's a fixed 30 year mortgage at a 3.375% rate. Refinancing it will increase that by at least a point. Plus, I originally did get the loan through a local credit union, who then sold it to a consolidator who sold it to a big national bank (which happened to be Wells Fargo). Local credit unions generally don't hold onto mortgages. They don't have the economies of scale to service them efficiently.
My 'local' credit union does. They dont sell mortgages/loans to anyone and keep them 100% in house. Its why even after moving to Germany I keep that bank for US assets. They just dont seem as criminally motivated as other banks.
It doesn't work that way. Mortgages get resold between banks all the time. You can walk into your local bank and get a house loan, and five months from now you'll get a business envelope from Wells Fargo informing you that you should send your checks to them now.

I would expect if WF decided your first loan met their purchasing criteria, your refinance would get the same treatment.

I moved loans to First Tech Federal Credit Union, which used to (and still probably do) state in the loan contract that they will service the loan for its life. Very happy with them so far.
That's good language to look for. I didn't check closely enough and my mortgage documents stated that the loan issuer (my bank) would service the loan, but without any guarantee of how long. They sold the loan shortly after issuing the mortgage.
As edoceo says below, I recommend refinancing with a credit union of your choice, preferably one that will pledge in writing to service the loan for its life.
Rates have gone up since, unfortunately. I'd be paying thousands in refi fees for the privilege of paying more each month in interest. No thanks.
If you care that much, you can refinance.
I just started a new job, so I'm just going to set up an account at my local CU and have my direct deposits go there. Guess I'll have to close my WF accounts once they get this mess sorted out!
This is one of the best financial decisions you will ever make.
At what interest rate? I have mine at 3.35% 30y fixed. Is that remotely possible today?
He's talking about a work account, not really a mortgage.
Mine is at 1% 15 year fixed. I believe the 30 year fixed rate is currently at 1.5%. The 1 and 3 year renewed are actually at negative rates.

I'm not in the US, though.

They had redundancy, the redundancy failed.

I mean truthfully when do you get to test your redundancy against a true disaster. It was a mess. WF is 20 companies rolled into one so the fact the disparate systems from 10 different banks works at all is kind of a miracle.

I can't recall who it was, but there was a big outage due to a data center running a monthly generator test and having a major customer go dark. The data center had 7 generators for 4 server rooms: 1 per room, a backup for each pair of rooms, and a backup for the backups. The primary and then both backups failed, so out went the lights.

You're pretty much damned if you do and damned if you don't. If you touch things that are working you could break them. If you don't touch things you never know what'll happen and you get fewer opportunities to learn. Move your servers around geographically and you might improve the odds that anything is working by reducing the odds that everything is working.

I don't think we're quite to a place yet where having servers down can be characterized as a non-event. Even if the customer can't see a behavioral difference, business units still tend to get quite anxious, and sometimes their theatrics put the whole process in jeopardy (not unlike trying to rescue a drowning man). It just hasn't been normalized yet.

Look at Netflix with chaos monkey and simian army. Netflix routinely does catastrophic destructive testing on their production systems, sometimes evacuating an AWS region entirely. To them, a server going down is a non-event, because they designed their systems with the premise that servers go down.
Netflix is a great example of how to create robust systems. Keep in mind though, they have a very different risk profile than a large bank. no one is going to lose their life’s savings if netflix entire infrastructure crumbles. This might happen in a bank with a single server outtage. Don’t get me wrong, if I start a bank today, I use chaos monkey stategy. But if I take over a bank infrastructure, with cobol still representing a huge percentage of mission critical code that everyone is scared to touch... no chaos monkey. I might deliberately turn off a server for an hour to see what chaos ensues, but it’ll be after 3 months of analysis, longer if I begin to suspect the system does more than anyone can remember.
Not to mention adhering to the Fed's directives on tech-readiness, on which the Bank's license hinges on.
If a customer can't watch a film they get unhappy.

If a customer can't pay a fine - can't use their bank account - they go to jail. https://www.telegraph.co.uk/finance/personalfinance/bank-acc...

These are pretty different outcomes.

Facebook regularly takes down multiple data-centers at a time to stress test this. Its users rarely notice (which I think is the point).
There is a different level of criticality for "my post didn't go through, hit refresh" and "my transaction didn't go through - the restaurant said my card isn't working."

Would you honestly want to go to a bank and say "if we unplug this, we can find out what fails."

Facebook handles money too, though. Also I think the parent was making the point that Facebook builds software with resiliency in mind so when a failure does happen, the software deals with it gracefully.
They can have (and did have, at least the long time ago when I still used it) weird cache persistency errors and "please just refresh to fix that" type of workflows if you have bad luck. That sort of behaviour is simply not acceptable for a bank.
They're <10 years old. Their DCs better be cleaner than my kitchen.

I've worked for banks here in Australia. Everything is 30+ years old. It's a shambles.

There is this company that had a grid power issue, batteries kicked in batteries running low, diesel generator show time, diesel generator doesn’t kick in. people scramble to shutdown before the batteries run out. So ITIL stuff happens and time to test it, guess what? diesel again doesn’t kick in (don’t recall the consequences thou)
Please expand ITIL, I don't know what it means.
https://en.m.wikipedia.org/wiki/ITIL Article explains it all
Then you’re an ITIL expert! Welcome to mid level IT management.

I’m serious.

Bloomberg does it once a year. That said, I have yet to encounter a company more obsessed with business continuity. I don't doubt that their failover systems and testing of them are well beyond the typical.
I think Bloomberg can stand to be down for 8 hours to simulate a disaster. Banks with legacy systems and people constantly dependent on them to conduct business can't risk an actual incident happening because they were testing what would happen if an incident happened.

Netflix designed their stuff from the ground up to fail over. Large monolith corporations who've inherited systems from other companies they've bought or merged with have challenges you won't see many places that have benefited from the 30 years of lessons that were taught at these companies.

> I think Bloomberg can stand to be down for 8 hours to simulate a disaster. Banks with legacy systems and people constantly dependent on them to conduct business can't risk an actual incident happening because they were testing what would happen if an incident happened.

No, it can't. Any loss of customer-facing functionality is a critical outage ("World Problem" in company terminology). There are a relatively small number of customers, but the terminal is critical to the operations of those who buy it. The terminal going down for eight hours would be a world-wide headline in the financial press.

A Tier 1 test that simulates loss of a datacenter takes a cluster one DC virtually offline. This puts an entire subset of services offline in that DC entirely. The test is coordinated with the teams who own the services to ensure their services fail over correctly. Any service disruption during the failover is a test failure. If it passes, the customers don't even know it happened. The goal is to be able to lose an entire DC and have the terminal customers not realize it until they hear about it on the news.

> I think Bloomberg can stand to be down for 8 hours to simulate a disaster.

Do you know what Bloomberg does? It powers equities trading markets around the world, 24/7. It isn't just news.

Well that's not true.

Chaos engineering and AWS weren't real things when they started building the company. And the system they have now doesn't resemble much of it was once.

Truth of the matter is they invested more in their infrastructure, but that's because their business plan required them to grow on the back of technological advances. Banks, it's seems, do not. Or maybe they do, and the some of these start up banks will usurp them.

Standard good practice should be to have a redundancy in place and test it at a regular interval. It should be part of periodic maintenance - fail to the backup so updates/grades can be applied to production and fail back to production once done.

But I’m guessing wellsfargo just doesn’t have a reason to care.

Business critical systems can’t afford not to test failover.

You can bail out of a test at the first sign of trouble. When a real outage hits, there’s no telling how long it will take to recover.

We run that drill every Wednesday morning.
Every major financial regulator has business continuity and disaster recovery requirements but the standards are woefully outdated. A plan that gets the bank back to full functionality within 24 hours would be acceptable to most regulators and even considered speedy to some.

Tangentially related, I highly recommend the movie "Out of the Clear Blue Sky." Cantor Fitzgerald was a bond trading firm at the top of one of the twin towers and lost every employee who was in the office on 9/11. Incredibly, despite losing the majority of their employees and despite losing almost all of their trading infrastructure, they managed to resume operations in time for the bond market's reopening 48 hours later.

What kind of redundancy are we talking about here?

You can't really roll back say 10 minutes of transactions, so are you maintaining 2 parallel systems? How do you keep them perfectly in sync?

This isn't my area of expertise by a long shot, but it occurs to me this is probably hard, especially when your codebase started in the 60s, and has been accreting ever since.

You have a primary and a backup with a synchronous commit protocol. When a commit request is made on the primary, the primary writes to its transaction log and the backup’s log. If the backup does not acknowledge, the commit fails.

The backup doesn’t need to be in the same exact state as the primary (it’s not meant to service requests), it just needs to have a persistent log of what changes were applied so that it can roll forward when needed.

Most relational DBs do something like this for their DR product offering. Oracle has Active Data Guard. DB2 has HADR.

Yes it’s hard but there are mature solutions. See Google Spanner for example.
Spanner is slow as hell
Okay, so say you switch banks. How do you have ANY idea if your new bank is better or worse? Are you going to ask customer support? They barely know how to access your account let alone the technical layout of their data centers.
I know that both Switzerland and Singapore have at least "soft" guidelines (no direct retaliation if you don't adhere to them, but frown upon if you don't) about the HW (including buildings & workplaces) & SW and personnel setup required to remediate a potential catastrophic failure of the data centers and/or key-employee's office building.

Example of high-level guidelines (Singapore): http://www.mas.gov.sg/~/media/resource/legislation_guideline...

I think that in Switzerland all major banks test their disaster-readiness (by switching everything to their secondary datacenters & working locations) of all critical applications/software-layers and employees at least once every 3 years - reaction/recovery times depend on the criticality of the service provided by the person/application.

2017 Bank of America had a similar outage,
Everyone assumes that the banking industry and the financial industry in general somehow have a magic touch when it comes to technology. All those digital balance sheets, they're pristine. All those networks, they're unpenetrated. All those databases, they're filled with the purest divine truth.

And if anyone ever figures out that isn't the way it is, and that the numbers are not representative of anything of substance? If nations refuse to honor the claimed 'transfers' done through these rickety electronic systems? It would make for an interesting few days.

I was shocked to learn the Fortune-500 bank I was working for had only a single data centre. Should they ever go down, much of the world would be disrupted. I think this is not unique.
Wells Fargo is an awful company. Hopefully this outage will be the final straw for many people.
The comment says that they have failovers in place, the just didn’t work correctly.
It's not lost, I know exactly where it is...at the bottom of the ocean!
Maybe this had something to do with it... https://searchdatacenter.techtarget.com/news/4500243727/Well...
Badly formatted paste of the content of the article: https://8cf91ddf-e6b2-465a-9f10-1f631650f4ed.htmlpasta.com

(Even if you can just put anything in the email form)

Lots of old companies are like this. They don't invest in critical infrastructure because it would bring them below their 20% margins.
Wells is weird though. When I started banking with them in 2009, they were ahead of the curve in online infrastructure - it was much easier to access them online than any of the banks I was used to back in Massachusetts. They also had a reputation for being both honest and conservative - they were the only AAA-rated bank in 2007, they were #1 in green rankings, and Warren Buffett had invested in them because of their sound balance sheet.

And then starting around 2010 but rapidly accelerating around 2014, everything about them went to shit.

The best explanation I can think of is that John Stumpf is a slash & burn sociopath, juicing the numbers so he can get his 473x-the-median-worker paycheck while ruining the company. He wouldn't be alone in the financial world, but it's a shame that a 150+ year old institution can so rapidly go down the toilet.

In 2009/2010 my wife accidentally made a charge to PayPal that was linked with our Wells Fargo checking account. Since we weren’t planning on that charge happening from that account, there was an NSF fee and Wells rejected the charger. Understandable, right?

Except then PayPal made a few more attempts for god knows why and each time Wells Fargo kicked an NSF fee our way.

Now, PayPal shouldn’t have repeatedly attempted a rejected charge. But, Wells Fargo shouldn’t have allowed those attempts. They just couldn’t help themselves to that $35 NSF fee though.

We fought it to no avail. With all the NSF fees and interest (and fees they added to fees while we fought it), what started as a $300 transaction ultimately cost us over $1200.

Wells Fargo is now and was in 2009/2010 a criminal enterprise.

Wells was one of the earlier companies to have the CTO report directly to the CEO.

All the crazy sales numbers and bogus account shenanigans were going on back in 2003/2004 when I worked there. I ratted out more than one professional banker to branch managers and up over that crap. A fun one was the home equity lines people would open without customer knowledge and link up to overdraft protection. The customer would never owe, nor know, anything until one day an overdraft hit their equity line, and then they got notified of late payments.. I don't miss working for a Bank.

I have banked with WF since about 2007 for my student and auto loans. Their online portal has always been worse than whatever my local credit union had.

The WF business is clearly set up to confuse and exploit consumers. My credit union websites have always helped me do what I want and need with my money. This includes the tiny local credit union in Idaho.

you jumped immediately to an unfounded speculation which is known to be wrong (WF maintains multiple data centers).

Also, failover is hard. Few companies outside of a few larger Internetz companies can really do it well.

I don't think you need this as a sign it's not technically competent. There are other much larger markers.
true, but you also give too much credit to the average person.

Im employed by WF and even Im a little bewildered by the fact there doesn't seem to be any redundancy implemented somewhere.

And they can't afford even a tiny bit of redundancy?

Big companies tend to defer risk. Managers and project leads want to start new projects rather than upgrade existing infrastructure. Combine these forces and sometimes you get a catastrophe.

> If I were a customer

Are you sure you're not one?

What all their scandals didn’t convince you of that?

After all their logo is that of a stage coach aka the wild Wild West(robbers, thieves, etc). They do not hide who they really are.

The stagecoach refers to their origins as an armored car/delivery service.
uh that may and well be true, but their behavior in recent years says stagecoach.. wild, wild west.. robbers and thus their logo is now a punchline/comedy routine.
Local fire Department commented on their facebook page [1] that dust from construction triggered the fire suppression system and not a fire.

[1] https://www.facebook.com/LakeJoFD

The next worst thing for electronics after fire - water! Or are there non liquid fire suppression systems such as pulling all the oxygen out of a room (if there aren't any people in it)?
are there non liquid fire suppression systems such as pulling all the oxygen out of a room

Yes. One place I worked (not a tech company, but with tons of electronics), when the fire alarms went off we had xx seconds (I don't remember the number) to get out of the building before something called Intergen was vented into the room to somehow suck all of the oxygen out, and if we were still inside we'd be dead.

It must be pretty serious stuff, because we'd have evacuation drills twice a year.

This page suggests that Inergen (unless there is a thing called Intergen that I'm confusing it with?) mixes with air to lower the oxygen content, but still remain breathable: http://www.tyco.no/products/Gaseous-Fire-Suppression/inergen...
The resultant 12% oxygen is 60% of normal atmospheric air. It's still breathable, if you're an average healthy person with healthy cardiac and pulmonary function, sitting at rest - but it's not healthy, it's enough that you're going to start seeing systemic responses. The reduction in partial pressure of oxygen being sucked into the lungs will also cause an increase in the partial pressure of carbon dioxide and inert gases.

I think the concern likely comes from:

-Folks that are of poor cardiac function are going to be evacuating, meaning increased cardiac demand under stress, while being somewhat oxygen-starved. This could tip some folks into an acute episode that otherwise wouldn't.

-Folks that are of poor oxygen function, who are borderline hypoxemic to begin with. Think folks with chronic obstructive pulmonary disorder: about 5% of your employees aged 55-65 will have it.

You won't suddenly kill a building full of people. I'm guessing the evacuation rush is to make sure they're not liable for unnecessarily sending a couple to the hospital.

Even if you were healthy you’d be in bad shape. I know from flying your mind goes quick when oxygen levels go down. You feel drunk and get a bit euphoric and refuse to acknowledge you’re in any trouble at all. Low oxygen, especially if the environment is dangerous or requires a clear mind and/or quick action, is a killer.
Possibly more importantly, the "venting" is not what you'd call a gentle breeze. "Explosive atmosphere replacement" is a better description. Solid things will go airborne at speed.
Also: Folks that are already suffering from smoke inhalation due to a fire. Unlikely since the halon systems are pretty good at reacting, but still possible.
>The resultant 12% oxygen is 60% of normal atmospheric air.

To put that in perspective, that's like being sent to the top of Pikes Peak (4.3km / 14,000') in seconds. Pilots flying that high in unpressurized aircraft are required to have oxygen masks. Most people will develop altitude sickness when rapidly subjected to that.

When you consider the potential for stress or panic in this kind of scenario, hypoxia emerges as a very real threat even for the young and healthy.

It's definitely not good to breathe in though, so you can still see why, for liability reasons, the company would want all employees to evacuate if it's going to be deployed, and leave it to firefighters with SCBA equipment.
Still a good reason for everyone to GTFO. Not worth breaking your neck over, but it's not good to put humans under the stress of a low-oxygen environment when they could be evacuated.
> Not worth breaking your neck over

That's an important caveat, given how the danger was apparently greatly exaggerated.

If you've been told that the system will

> somehow suck all of the oxygen out, and if we were still inside we'd be dead

then what are you going to do in a real fire situation, when you're not in imminent danger but your escape route is blocked by flames? Better to brave the fire (or jump out a window) than submit to death by suffocation...

This escape room sounds awesome.
Yes water isn't typically used in datacenters, halon gas can be used for example. But the noise of the nozzles releasing the gas can actually damage disk drives.
There's a nozzle for that :) - http://www.fire-protection.com.au/news/hush-nozzle

From what I understand, using water/dry pipe isn't unheard of. Some prefer it over Halon - https://blog.equinix.com/blog/2014/03/26/we-must-protect-thi... .

> We think water is superior to using the firefighting chemical compound Halon, because water Is less damaging to technology and Halon can destroy circuit cards.

Did not know Halon could destroy circuit cards. Apparently it also damages the Ozone.

Yeah, some of them are literally CFCs.
yeah, the data center affected apparently had a halon gas suppression system.
Halon is also not great - it can damage disks (360 psi) and even if it doesn't it voids the warranty on everything so you'll be getting new hardware
Ahh... The VAX war. http://bogpeople.com/funny/content/vaxwarstory.html

> VAXen, my children, just don't belong some places.

Suffice to say... yes. There are fire suppression systems that will pull oxygen out of the room. People are advised to leave the room before the fire suppression system takes effect.

People are also advised to leave the room before the fire takes effect.
> No power to any of the network or compute equipment and some failovers did not work as expected.

Wonder if it was the switchgear that failed. Amazon uses custom firmware in its switchgears because this happens so often (Superbowl 2013 etc.)

Why didn't it failover to their backup datacenter in another geo?
right? DRBC trial by fire, I guess (pardon the pun).