Problems in software: fault, error and failure

Explosion of first Ariane 5 flight, June 4, 1996
Explosion of first Ariane 5 flight, June 4, 1996

The sad true is so much time spend to handling negative scenarios in our software execution. Programmers are excited when start new project, we think about all this fantastic things which our software will do, but soon we discover that additionally to “happy path” we also need to add a lot of code to secure unusual situations, we have to do this do avoid failure.

Failure

Situation when software starts to work not according to its requirements is named failure. Examples of failures: unexpectedly software stops its execution ( e.g crash), when software is not responsible for user interaction, when returns incorrect results. The failure is visible to observers and systems users. When there is no requirements failures cannot happen.

Nobody likes failures. Users cannot complete their tasks, business has to find excuse for customers, failures are reported as bugs (gentle name: tickets) and programmers need to fix them instead of adding new fantastic features to systems. In embedded world software failure may change device to unachievable, dead piece of metal ( eg. turn expensive satellite into flying brick).

Error

An error is incorrect system behavior which can lead to failure. Errors are normal situation in software, almost every function in every API could reports errors. Errors are very important in software, they allow programmers to handle it and thus prevents failures. We can distinguish two kinds of errors: value and timing. Value errors might be incorrect system state or incorrect object value. Timing error includes total non-performance of system (eg. infinite loop, permanent lock on mutex).

When error leads to failure ?

Failure may happens when requirements omits possibility of error occurrence, eg. it is required to modify user data in database, but unexpectedly database file is corrupted so data are not modified what breaks requirement. If requirement would be extended to description what to do when database is corrupted, then programmer could handle the error according to specification.

Another situation is when error is omitted or ignored by the programmer. I see this often when I make a code review. Generally programmers are not willing to check if function which they invoke reports error. I suppose it happen because we know how much code must be written additionally to “happy path” to handle error, and how the code become complicated. Frequent excuse is : “it is almost impossible that this function reports an error”, indeed it is true but in harsh space environment, where radiation flips bits, I prefer do not ignore possible errors.

When error is not detected it cannot be handled and leads to failure. For example we know that when program stuck in infinite loop it is error situation, but when we do not detect this we cannot prevent to show up the problem to user.

Summarizing error detection, handling and good specification are required to prevent change error to system failure.

Fault

Fault is a defect in the system that can cause an error. Before an error occurred fault is not visible for observers and users. When fault occur and it does not make any visible effects it said to be latent. In some circumstances latent fault cause something incorrect happen and error occurs. Fault is a physical defect in hardware, imperfection in design and manufacturing, or bugs in software.

Kinds of faults

  • Permanent: fault occurred and continues
  • Transient: fault occurs at a certain time, and then system back to normal work
  • Intermittent: recurring fault – sometimes occurs, sometimes it does not
  • Benign: faulty easy to detect, cause obvious error (eg. component crashes)
  • Malicious: fault very hard to detect, component works inaccurate (eg. memory leaks, incorrect result of computation)

Summarize

All problems start from fault, fault cause an error, and if it is not handled properly cause a failure:

FAULT -> ERROR -> FAILURE

The biggest part of this post was taken by description of errors, indeed errors are most important, all software developers works with errors in code, and the biggest part of code was taken by error handling.

References

  • Mengfei Yang, Gengxin Hua, Yanjun Feng, Jian Gong (2017) Fault-Tolerance Techniques for Spacecraft Control Computers. Wiley, .
  • Robert Hanmer (2007) Patterns for Fault Tolerant Software. Wiley, .

Leave a Reply

Your email address will not be published. Required fields are marked *