Affiliation:
1. National Aeronautics and Space Administration, Hampton, Virginia 23666
Abstract
Since the first use of computers in spacecraft and aircraft, software errors have occurred. These errors can manifest as loss of life or less catastrophically. As the demand for automation increases, software in mission- or safety-critical systems should be designed to be tolerant to the most likely software faults. This paper categorizes historic aerospace software errors to determine trends of how and where automation is most likely to fail. A distinction between software producing wrong (erroneous) output versus no output (fail-silent) is introduced. Of the historical incidents analyzed, 85% were from software producing erroneous output rather than stopping. Rebooting was found to be ineffective in clearing erroneous behavior and not reliable to recover from silent software. Errors originated from within the code/logic itself in 58% of cases, 16% from configurable data, and 25% introduced through input sources, command or sensor. Forty percent of unexpected software behavior was caused by the absence of software, and 16% was subjectively due to “unknown-unknowns.” These findings indicate that to achieve software fault tolerance, backup strategies must be employed to detect and respond to erroneous software behavior beyond only fail-silent cases, and robust off-nominal testing should be performed to uncover unanticipated situations.
Publisher
American Institute of Aeronautics and Astronautics (AIAA)