The Stuck Bit

A number of years ago I was doing engineering design analysis using a large main frame computer. The program I was using was both complicated and required a lot of precision. This analysis required unusually large amounts of memory to run, so this program was usually run at night when it didn't need to share the computer with a lot of other programs. Occasionally I got obviously erroneous results. The computer did not give the same result on runs with the same data. I did several test runs to verify the problem and got the same bad results a couple of times and apparently correct but not quite exactly the same results on some other runs. All the results should have been exactly the same.

I informed the computer staff that I believed there was a problem with the computer and showed them the the results of my runs. Some weeks later, and only because I knew some of the computer staff well, I found out the following. There had been several complaints about bad results over the previous year but they had been unable to find a problem and assumed they were, like most problems, user coding errors. After our discussion they used my program as a diagnostic tool.

When my program was loaded alone in the computer the error was reproducible. After a lot of work they identified a bad bit in the computer memory. This bit was one of the least significant bits in a variable storage location so it usually did not create obvious errors. My program gave obvious errors because it subtracted two very large number, one of which had the error, and used what should have been a small difference. The error was reproducible since the program was big enough to use the bad bit location when loaded alone.

They fixed the problem, and I ran all my runs again, but I never heard about there being a problem officially. I will never know how many designs were based on errors caused by that bad bit.

Lessons Learned

  1. Very small errors in data or software can occasionally produce big important errors.

  2. Embarrassing problems are sometimes kept hidden by service organizations.