Failure Prediction in Complex Systems

This shows why predicting failures in complex system is very difficult and why even extensive analysis usually underestimates the number and severity of failures.

The minuteman system was a land based intercontinental ballistic missile system constructed during the cold war. The missiles were in underground silos as were the control bunkers. These silos and control bunkers were hardened against nuclear attack by being underground with thick concrete walls. They were located in remote rural areas.

All portions of the Minuteman systems were made with the most reliable components. The system was analyzed for possible failures in great detail, as were the high reliability individual components used to build it. Millions of dollars were spent on reliability analysis to reduce the chances of a false missile launch due to some failure. Individual and combinations of failures were analyzed to determine what would happen.

At one time I worked next to a couple of engineers who were on call to investigate the rare Minuteman failures. We sometimes talked about their investigations. We also discussed a review of the reliability predictions after the system was in the field a number of years.

In reviewing the reliability predictions the statistical estimates of failure rates were quite good for all the thing considered in the analysis. Everything the reliability people could think of was included in the original analysis. However, this only predicted about half the real failures! The other half were things no one even considered. To understand why these failures were not considered I will relate two examples.

1 A nuclear hardened underground cable failure:

Signals failed to go through some of the wires in a cable between two underground locations. This cable was buried underground and encased in a hardened covering designed to withstand nuclear attack. The cable was dug up to find the problem.

It was discovered that in one spot the hardened covering was worn away and several wires cut. This was apparently done by a determined gopher or gophers. The cable intersected one of their tunnels and they considered it an inconvenience. They nipped at it thousands of times until they wore through the hardened outer shell and into the wires causing the failure. Gophers were not considered in the failure analysis.

2. Strange intermittent computer failure.

A computer in one of the underground control bunkers began exhibiting strange intermittent errors that the hardware failures analysis could not explain. Possible computer failures had been analyzed in great detail. The things happening in this computer could not be explained by any known combination of components with potentially intermittent behavior.

Upon investigation a mouse nest was found in the computer. A mouse had found a warm place out of the North Dakota winter to raise a family. No one knew how the mouse got in through the thick concrete walls. As the baby mice urinated in various spots, this would sometimes short out wires on the computer board until the urine dried and no longer conducted electricity well enough to cause a problem.

The Lesson Learned:

In complex systems it is almost impossible to predict all of problems that can and will occur.

Note that the Minuteman system is orders of magnitude simpler than many parts of our infrastructure.