Hacker News new | ask | show | jobs
by darkwater 94 days ago
The fact that there is not a single root cause but several ones makes me instinctively think this is a good report, because it's not what the "bosses" (and even less politicians) like to hear.
6 comments

Yes, a lot of modern engineering is good enough that single-cause failures are very rare indeed. That means that failures themselves are rare, but when they do happen, they're most likely to have multiple causes.

How to explain that to non-engineers is another problem.

I think a better way of explaining it to people is that we've made critical systems so reliable that, in order for them to fail, the failures have to be quite complex.
This is almost universal in aviation. They always talk about the "accident chain." Essentially everything that can kill you with one mistake is illegal through training and operational requirements and engineering and maintenance regulations.
we are not as complicated as the national grid, I have been here for nearly 10 years now, and our outages have gone from single cause, two causes, or now its nearly always 3 things that need to go wrong at the same time.
Frequently, when you see these massive failures, the root cause is an alignment of small weaknesses that all come together on a specific day. See, for instance, the space shuttle O-ring incident, Three-Mile Island, Fukushima, etc. These are complex systems with lots of moving parts and lots of (sometimes independent) people managing them. In a sense, the complexity it the common root cause.
This is the same thing that happened with the 35W bridge collapse in Minneapolis. The gusset plates after the disaster were examined and found to be only 1/2" thick when the original design called for them to actually be 1" thick. The bridge was a ticking time bomb since the day it was built in 1967.

As the years went on, the bridge's weight capacity was slowly eroded by subsequent construction projects like adding thicker concrete deck overlays, concrete median barriers and additional guard rail and other safety improvements. This was the second issue, lining up with the first issue of thinner gusset plates.

The third issue that lined up with the other two was the day of the bridges failure. There were approximately 300 tons of construction materials and heavy machinery parked on two adjacent closed lanes. Add in the additional weight of cars during rush hour when traffic moved the slowest and the bridge was a part of a bottleneck coming out of the city. That was the last straw and when the gusset plates finally gave way, creating a near instantaneous collapse.

It's like the Swiss Cheese model where every system has "holes" or vulnerabilities, several layers, and a major incident only occurs when a hole aligns through all the layers.

https://en.wikipedia.org/wiki/Swiss_cheese_model

I use this model all the time. It's very helpful for explaining the multifactorial genesis of catastrophes to ordinary people.
Also perhaps worth a read:

https://devblogs.microsoft.com/oldnewthing/20080416-00/?p=22...

"You’ve all experienced the Fundamental Failure-Mode Theorem: You’re investigating a problem and along the way you find some function that never worked. A cache has a bug that results in cache misses when there should be hits. A request for an object that should be there somehow always fails. And yet the system still worked in spite of these errors. Eventually you trace the problem to a recent change that exposed all of the other bugs. Those bugs were always there, but the system kept on working because there was enough redundancy that one component was able to compensate for the failure of another component. Sometimes this chain of errors and compensation continues for several cycles, until finally the last protective layer fails and the underlying errors are exposed."

I've had that multiple times. As well as the closely related 'that can't possibly have ever worked' and sure enough it never did. Forensics in old codebases with modern tools is always fun.
> As well as the closely related 'that can't possibly have ever worked' and sure enough it never did.

I had one of those, customer is adamant latest version broke some function, I check related code and it hasn't been touched for 7 years, and as written couldn't possibly work. I try and indeed, doesn't work. Yet customer persisted.

Long story short, an unrelated bug in a different module caused the old, non-functioning code to do something entirely different if you had that other module open as well, and the user had disciverdd this and started relying on this emergent functionality.

I had made a change to that other module in the new release and in the process returned the first module to its non-functioning state.

The reason they interacted was of course some global variables. Good times...

> See, for instance, the space shuttle O-ring incident

That wasn't really a result of an alignment of small weaknesses though. One of the reasons that whole thing was of particular interest was Feynman's withering appendix to the report where he pointed out that the management team wasn't listening to the engineering assessments of the safety of the venture and were making judgement calls like claiming that a component that had failed in testing was safe.

If a situation is being managed by people who can't assess technical risk, the failures aren't the result of many small weaknesses aligning. It wasn't an alignment of small failures as much as that a component that was well understood to be a likely point of failure had probably failed. Driven by poor management.

> Fukushima

This one too. Wasn't the reactor hit by a wave that was outside design tolerance? My memory was that they were hit by an earthquake that was outside design spec, then a tsunami that was outside design spec. That isn't a number of small weaknesses coming together. If you hit something with forces outside design spec then it might break. Not much of a mystery there. From a similar perspective if you design something for a 1:500 year storm then 1/500th of them might easily fail every year to storms. No small alignment of circumstances needed.

In reality the "swiss cheese" holes for major accidents often turn out to be large holes that were thought to be small at the time.

> [Fukushima] No small alignment of circumstances needed.

The tsunami is what initiated the accident, but the consequences were so severe precisely because of decades of bad decisions, many of which would have been assumed to be minor decisions at the time they were made. E.g.

- The design earthquake and tsunami threat

- Not reassessing the design earthquake and tsunami threat in light of experience

- At a national level, not identifying that different plants were being built to different design tsunami threats (an otherwise similar plant avoid damage by virtue of its taller seawall)

- At a national level, having too much trust in nuclear power industry companies, and not reconsidering that confidence after a number of serious incidents

- Design locations of emergency equipment in the plant complex (e.g. putting pumps and generators needed for emergency cooling in areas that would flood)

- Not reassessing the locations and types of emergency equipment in the plant (i.e. identifying that a flood of the complex could disable emergency cooling systems)

- At a company and national level, not having emergency plans to provide backup power and cooling flow to a damaged power plant

- At a company and national level, not having a clear hierarchy of control and objective during serious emergencies (e.g. not making/being able to make the prompt decision to start emergency cooling with sea water)

Many or all of these failures were necessary in combination for the accident to become the disaster it was. Remove just a few of those failures and the accident is prevented entirely (e.g. a taller seawall is built or retrofitted) or greatly reduced (e.g. the plant is still rendered inoperable but without multiple meltdowns and with minimal radioactive release).

To be blunt; that isn't an appropriate application of the swiss cheese model to Fukushima. It isn't a swiss cheese failure if it was hit by an out-of-design-spec event. Risk models won't help there. Every engineered system has design tolerances. And that system will eventually be hit by a situation outside the tolerances and fail. Risk models aren't to overcome that reality - they are one of a number of tools for making sure that systems can tolerate situations that they were designed for.

If Japan gets traumatised and changes their risk tolerance in response then sure, that is something they could do. But from an engineering perspective it isn't a series of small circumstances leading to a failure - it is a single event that the design was never built to tolerate leading to a failure. There is a lot to learn, but there isn't a chain of small defence failures leading to an unexpected outcome. By choice, they never built defences against this so the defences aren't there to fail.

> Many or all of these failures were necessary in combination for the accident to become the disaster it was.

Most of those items on your list aren't even mistakes. Japan could reasonably re-do everything they did all over again in the same way that they could simply rebuild all the other buildings that were destroyed in much the same way they did the first time. They probably won't, but it is a perfectly reasonable option.

Again I'm going from memory with the numbers but doubling the cost of a rare disaster in a way that injures ... pretty much nobody ... is a great trade for cheap secure energy. It isn't a clear case that anything needs to change or even went wrong in the design process. Massive earthquakes and tsunamis aren't easy to deal with.

> It isn't a swiss cheese failure if it was hit by an out-of-design-spec event

First of all, the design basis accident is a design choice by the developers of the plant and regulators. The decision process that produced that DBA was clearly faulty - the economic and social costs of the disaster so clearly have exceeded those of a building to a more serious DBA.

> Again I'm going from memory with the numbers but doubling the cost of a rare disaster in a way that injures ... pretty much nobody ... is a great trade for cheap secure energy. It isn't a clear case that anything needs to change or even went wrong in the design process. Massive earthquakes and tsunamis aren't easy to deal with.

This is absolute nonsense. For the cost of maybe maybe tens of millions at most in additional concrete to build the seawall a few meters higher, the entire disaster would have been avoided entirely (i.e. plant restored to operation). With backup cooling that could have survived the tsunami (a lower expense than building a higher seawall), all that would have happened at Fukushima Daiichi is what happened at its neighbor Fukushima Daini (plant rendered inoperable, no meltdown, no significant radioactive release). Instead, we are talking about a disaster that will cost a (current) estimated $180 billion USD to clean up (and there is no way this estimate is realistic, when the methods required to perform the cleanup barely exist yet).

> The decision process that produced that DBA was clearly faulty - the economic and social costs of the disaster so clearly have exceeded those of a building to a more serious DBA.

That isn't clear at all. We're effectively sampling from the entire globe and we've had 2-3x bad nuclear disasters since the 70s. Our safety standards appear to be overcautious given the relatively small amount of damage done vs ... pretty much every alternative. The designs seem to be fine. I'm still waiting to see the justification for the evacuations from Fukushima; they seemed excessive. People died.

> For the cost of maybe maybe tens of millions at most...

You haven't thought for long enough before you typed that. For this particular disaster, sure. But hardening against all the possible disasters is what needs to happen when you become less risk tolerant. It is the millions of dollars to prevent against this disaster multiplied by the number of potential disasters that you have to consider. Safety is expensive.

The numbers aren't small, safety of that magnitude might not even be economically feasible. To say nothing of whether it is actually sensible. And once you get into one in 500 or thousand year events, some really catastrophic stuff starts happening that just can't be reasonably defended against. San Francisco and its fault springs to mind, I forget what sort of even that is but it is probably once a millennium or more often.

There was a strong corporate cultural component to Fukushima as well. Tepco had spent decades telling the Japanese public that nuclear power was completely safe. A tall order in Japan obviously, but by and large it worked.

During the operation of Fukushima Daiichi, various studies had been done that recommended upgraded safety features like enlarging the seawall, moving the emergency generators above ground so they couldn't be flooded, etc.

In every case, management rejected the recommendations because:

1. They would cost money.

2. Upgrading safety would be tantamount to admitting the reactors were less than safe before, and we can't have that.

3. See 1.

I’m not sure why you think those are not a confluence of smaller events or that something outside the design spec isn’t one of those factors. By “small,” I don’t mean trivial. I mean an event that by itself wouldn’t necessarily result in disaster. Perhaps I should have said “smaller” rather than “small.” With the O-rings, the cold and the pressure to launch on that particular day all created the confluence. With Fukushima, the earthquake knocked out main power for primary cooling. That would have been manageable except then the backup generators got destroyed by the tsunami. It was not a case of just a big earthquake, whether outside or inside the design spec, making the reactor building fall down and then radiation being released.
If Fukushima get hit by a disaster that is outside the design spec then the engineering root cause of the failure is established. There isn't some detailed process needed to figure out how a design should tolerate out-of-design events. And there isn't a confluence of smaller events, it is a very cut and dry situation (well, unstable and wet situation I suppose). There was one event that caused the failure. An event on a biblical scale that was hard to miss.

If you want Fukushima to tolerate things it wasn't designed to tolerate or fail in ways it wasn't designed to fail in then the swiss cheese model isn't going to be much help. You're going to need to convince politicians and corporate entities that their risk tolerance is too high. Which in a rational world would be a debate because it isn't obvious that the risk tolerances were inappropriate.

The design spect tsunami resistance is for getting away with just a couple days downtime plus what the grid concerns.

A much higher much rare case is what happened and which they didn't have a plan ready on hand.

Even if you treat the box as the special being they wre...

It usually starts with a broken coffee machine.
When that happens, get ready.
They need more battery storage for grid health, both colocated at solar PV generators (to buffer voltage and frequency anomalies) and spread throughout the grid. This replaces inertia and other grid services provided by spinning thermal generators. There was no market mechanism to encourage the deployment of this technology in concert with Spain’s rapid deployment of solar and wind.
There are non-battery buffers available too--I recently got rooftop residential solar installed, and learned that my area is covered by a grid profile requiring that the solar system stay online through something like 60 +/- 2Hz before shutting down completely, and ramping down production linearly beyond a 1Hz deviation or so. The point is to avoid cascading shutdowns by riding through over/undersupply situations, whereas an older standard for my area would have the all solar systems cut off the moment frequency exceeded 60.5Hz (which would indicate oversupply from power plant generators spinning faster via lower resistance).

In my system's case, switching to this grid profile was just a software toggle.

This is grid following, very effective for small scale generation. It does not work for large scale generation though when the grid is relying on that voltage and frequency from the utility scale renewable generation ("grid forming"). When those large generators exceed their ride through tolerance, batteries step in to hold voltage and frequency up until the transient event ends or dispatchable generators called upon spin up (currently fossil gas primarily, but also nuclear if there is headroom to increase output). Thermal generators can take minutes to provide this support (called upon, fuel intake increased, spinning metal spins faster), batteries respond within 250-500ms.

Tesla’s Megapack system at the Hornsdale Power Reserve in Australia was the first example of this being proven out at scale in prod. Batteries everywhere, as quickly as possible.

One problem that happened here is the _voltage_ spikes as the synchronous generation went away. Voltage _spikes_ on generation going away seem insane, but it's a real phenomenon.

The problem is that the line itself is a giant capacitor. It's charged to the maximum voltage on each cycle. Normally the grid loads immediately pulls that voltage down, and rotating loads are especially useful because they "resist" the rising (or falling) voltage.

So when the rotating loads went away, nothing was preventing the voltage from rising. And it looks like the sections of the grid started working as good old boost converters on a very large scale.

Nope, they need more inertial storage to smooth things out and buy time / absorb inevitable failure bursts/cascades from inverted production means or safety disconnection events.
Battery storage provides this grid service, as mentioned in my other comments.
In this very specific case, battery storage would not have helped (in fact, it would have worsened the problem). One of the issues in the failure is renewables, but not because of intermittence. It's because of their ~infinite ramp and them being DC.

Anything that's not a spinning slug of steel produces AC through an inverter: electronics that take some DC, pass it through MOSFETs and coils, and spits out a mathematically pure sine wave on the output. They are perfectly controllable, and have no inertia: tell them tout output a set power and they happily will.

However, this has a few specific issues:

- infinite ramps produce sudden influx of energy or sudden drops in energy, which can trigger oscillations and trip safety of other plants

- the sine wave being electronically generated, physics won't help you to keep it in phase with the network, and more crucially, keep it lagging/ahead of the network

The last point is the most important one, and one that is actually discussed in the report. AC works well because physics is on our side, so spinning slugs or steel will self-correct depending on the power requirements of the grid, and this includes their phase compared to the grid. How out-of-phase you are is what's commonly called the power factor. Spinning slugs have a natural power factor, but inverter don't: you can make any power factor you want.

Here in the spanish blackout, there was an excess of reactive power (that is, a phase shift happening). Spinning slugs will fight this shift of phase to realign with the correct phase. An inverter will happily follow the sine wave measured and contribute to the excess of reactive power. The report outlines this: there was no "market incentive" for inverters to actively correct the grid's power factor (trad: there are no fines).

So really, more storage would not have helped. They would have tripped just like the other generators, and being inverter-based, they would have contributed to the issue. Not because "muh renewable" or "muh battery", but because of an inherent characteristic of how they're connected to the grid.

Can this be fixed? Of course. We've had the technology for years for inverters to better mimic spinning slugs of steel. Will it be? Of course. Spain's TSO will make it a requirement to fix this and energy producers will comply.

A few closing notes:

- this is not an anti-renewables writeup, but an explanation of the tech, and the fact that renewables are part of the issue is a coincidence on the underlying technical details

- inverters are not the reason the grid failed. but they're a part of why it had a runaway behavior

- yes, wind also runs on inverters despite being spinning things. with the wind being so variable, it's much more efficient to have all turbines be not synchronized, convert their AC to DC, aggregate the DC, and convert back to AC when injecting into the grid

I agree with your detailed assessment, but importantly, I argue more battery storage would've allowed for the grid to fail gracefully through rapid fault isolation and recovery (assuming intelligent orchestration of transmission level fault isolation). Parallels to black start capabilities provided by battery storage in Texas (provided by Tesla's Gambit Energy subsidiary). When faults are detected, the faster you can isolate and contain the fault, the faster you can recover before it spreads through the grid system.

The storage gives you operational and resiliency strength you cannot obtain with generators alone, because of how nimble storage is (advanced power controls), both for energy and grid services.

> Can this be fixed? Of course. We've had the technology for years for inverters to better mimic spinning slugs of steel. Will it be? Of course. Spain's TSO will make it a requirement to fix this and energy producers will comply.

This is synthetic inertia, and is a software capability on the latest battery storage systems. "There was no market mechanism to encourage the deployment of this technology in concert with Spain’s rapid deployment of solar and wind." from my top comment. This should be a hard requirement for all future battery storage systems imho.

Potential analysis of current battery storage systems for providing fast grid services like synthetic inertia – Case study on a 6 MW system - https://www.sciencedirect.com/science/article/abs/pii/S23521... | https://doi.org/10.1016/j.est.2022.106190 - Journal of Energy Storage Volume 57, January 2023, 106190

> Large-scale battery energy storage systems (BESS) already play a major role in ancillary service markets worldwide. Batteries are especially suitable for fast response times and thus focus on applications with relatively short reaction times. While existing markets mostly require reaction times of a couple of seconds, this will most likely change in the future. During the energy transition, many conventional power plants will fade out of the energy system. Thereby, the amount of rotating masses connected to the power grid will decrease, which means removing a component with quasi-instantaneous power supply to balance out frequency deviations the millisecond they occur. In general, batteries are capable of providing power just as fast but the real-world overall system response time of current BESS for future grid services has only little been studied so far. Thus, the response time of individual components such as the inverter and the interaction of the inverter and control components in the context of a BESS are not yet known. We address this issue by measurements of a 6 MW BESS's inverters for mode changes, inverter power gradients and measurements of the runtime of signals of the control system. The measurements have shown that in the analyzed BESS response times of 175 ms to 325 ms without the measurement feedback loop and 450 ms to 715 ms for the round trip with feedback measurements are possible with hardware that is about five years old. The results prove that even this older components can exceed the requirements from current standards. For even faster future grid services like synthetic inertia, hardware upgrades at the measurement device and the inverters may be necessary.

Yep, sounds like "This was bound to happen at some point"
Which on some level is exactly "what the bosses and politicians want to hear"

When it's everybody's fault it's nobody's fault.

In some ways, yes, but yet it's what reality is. There was probably some last factor kicking in that triggered the cascade, but there were probably many non-happy-paths not properly covered by working backup/fallback strategies. So a report could totally still tell "it's X fault", pointing the finger there. Government would blame the owner of X, some public statement about fixing X would be made and then the ones working in the field should internally push toi improve/fix their own (reduced) scope.

I don't know what will come of this report in the next months/years, I will keep an eye on it though, since I live in Spain :)

Exactly.
But EU's liberalized energy market gives us resiliency and low prices for electricity! /s
But not across the Pyrenees :_)
There are ways to aggregate these into a single resilience score for policy makers with only moderate loss of detail but it's unpopular.
It is very carefully worded, but variable renewables are holding the smoking gun here. This is why spain now requests a better connection to french nuclear now. This reckless overbuild of variable generation is a valuable negative example, wind and solar without adequate hydro or nuclear is dead
Your statement is wrong.

The report describes that there was no mechanism to dispatch the reactive power of renewables separately from the active power.

In page 452, item numbered 1 states "RES power plants follow fixed power factor" (RES = Renewable Energy Sources). The source of this finding is in section 4.2.1.

In page 208, footnote 35, the reference is given to Royal Decree 413/2014 of 6 June, which mandates this fixed power factor. The Article 7, section e), states that renewable energy sources must follow the instructions given by the operator to set power factor, and only if the distribution lines support it.

And footnote 36 describes how this worked in practice on the date of the outage: renewables were told, by email on the previous day, which fixed power factor correction to use the following day.

--

This lack of dynamic dispatch of reactive power was a known problem, already reported in 2022 [1]

[1] https://www.eldiario.es/economia/competencia-reconocio-julio...

It's lack of experience managing variability, not variability itself.

Wind and solar are very far from dead, but they do need some adjustments - as the report makes clear.

Spaniard here, I didn't hear about that.