Hacker News new | ask | show | jobs
by makmanalp 4005 days ago
I don't think we have the ability to 100% simulate anything - I don't doubt for a second that a lot of this stuff is being simulated already, and I think you may be severely overestimating just how useful any doable simulation is in terms of catching a bug like this one.

You have literally hundreds of systems working in concert and tied to more hundreds of physical components coming under extreme temperature and pressure conditions, some of which can interact in the weirdest and most unexpected ways - certainly not ones you'd always think to model. The chances that any one of those does something unexpected is not low, and the chances that it cascades into a much larger failure is non-significant.

edit: It's also sometimes a human problem - thousands of people working on this together, and all sorts of different incentives. Here's a famous example of a failure, and the PR kerfuffle that ensued: https://en.wikipedia.org/wiki/Rogers_Commission_Report

Quoth Feynman:

"It appears that there are enormous differences of opinion as to the probability of a failure with loss of vehicle and of human life. The estimates range from roughly 1 in 100 to 1 in 100,000. The higher figures come from the working engineers, and the very low figures from management. What are the causes and consequences of this lack of agreement? Since 1 part in 100,000 would imply that one could put a Shuttle up each day for 300 years expecting to lose only one, we could properly ask "What is the cause of management's fantastic faith in the machinery? .. It would appear that, for whatever purpose, be it for internal or external consumption, the management of NASA exaggerates the reliability of its product, to the point of fantasy."

1 comments

>You have literally hundreds of systems working in concert and tied to more hundreds of physical components coming under extreme temperature and pressure conditions

This is exactly what computers are for: doing hard stuff we can't do on paper or just by real world prototype testing. I imagine this is a hard problem, but it may be so because from a time/budget perspective it may just make financial sense to let stuff blow up now and again, than build out such a system.

I kinda see this as the difference between writing typical code versus writing code that's deterministic. The former is cheaper/faster but the latter is safer but more expensive and slower. In growth industries or when you have a strict schedule on your back, the slower approach is often ignored.

>Quoth Feynman

Feynman died when the hottest CPU was the 386. We simply have the capabilities, at least in hardware, for non-trivial simulation that during Feynman's time would have required CPU resources ridiculous to even speculate about. Safe assumption in Feynman's world (1918-1988), at least in regards to technology and engineering, may not be safe assumptions in our world. The same way our assumptions today won't make too much sense for our grandchildren. They might be bewildered by the idea that rocket fails were constant and common, the same way I'm bewildered by things like hot-days causing vapor lock to shut down old cars or, say, occasionally tuning a carburetor. We have electric gas pumps and computer controlled fuel injectors now.

edit: to reply to jacquesm. That's a pretty bold claim about O-rings. We fully understand the materials they're made of, their typical decays, etc. They're not magic. If someone wanted to make a top-down simulation that included, well, everything, it certainly seems possible to me, and while certainly not perfect, if done right, should provide positive outcomes. The real question is, what's the incentive? Spend billions and years doing this for one system (which may be old or even obsolete by the time the simulation is complete) or just accept the occasional preventable loss. Seems the latter approach just makes more sense financially, but that doesn't mean the former approach must be impossible. Many things are possible that just aren't incentivized.

Oh gosh, having worked on a fairly large vacuum system, I can tell you that o-rings are monsters. Very minor errors in dimensions can mess up the seal, and temperature/humidity/wear/elasticity and all that can subtly mess with the dimensions in crazy nonlinear ways. You can simulate the ever loving garbage out of it and an imperceptible change in composition due to an undetectible mixing error when extruding the ring can cause a seal to slightly leak. Mayhem ensues. (And most likely any attempt to directly detect it will destroy the integrity of the o-ring or take so long to render the test useless, since there are usually hundreds of o-rings (or in the thousands - o-rings are all over the place).)

I'm not even talking about jackquesm's note about the failure mode, either. Just real insidious errors in manufacturing that can't be detected in any sort of reliable, sane way. Even the Challenger's o-ring wasn't guaranteed to fail, and indeed most didn't. In fact, most of that entire o-ring didn't fail.

I've seen some really freaky things amplify what are essentially chaotic edge cases. You can certainly figure them out, but you'd never get anything done for any level of affordability in time for any ship date if you didn't just calculate risk and go ahead.

TL;DR: risk is always there because the world's imperfect. At best you just tighten the statistical confidence, but that's super hard.

Faster computers do not equate to magically better programs and/or programmer capabilities. The computers from Feynmans days could do finite element analysis and structural simulation on grids fine enough for just about all engineering work. It's the execution details that get you (such as an O-ring...). And nobody simulates the execution in a meaningful way simply because there isn't enough data to start your simulations with. These are human failures first, process failures second.
The problem with simulating failures is that there are an infinite number of them. Should you simulate the effects of omitting each individual molecule in the whole assembly? Or adding one of every possible contaminant molecule at every possible location? What about more than one? It's a combinatorial impossibility.