The sad truth is, we spend lots of time handling negative scenarios in our software execution. Programmers are excited when starting a new project, we think about all these fantastic things which our software will do, but soon we discover that additionally to “happy path” we also need to add a lot of code to secure unusual situations, we have to do this do avoid failure.
The situation when the software starts to work not according to its requirements is named failure. Examples of failures: unexpectedly software stops its execution ( e.g crash) when the software is not responsible for user interaction when returns incorrect results. The failure is visible to observers and systems users. When there is no requirements failures cannot happen.
Nobody likes failures. Users cannot complete their tasks, the business has to find an excuse for customers, failures are reported as bugs (gentle name: tickets) and programmers need to fix them instead of adding new fantastic features to systems. In embedded world software failure may change the device to an unachievable, dead piece of metal ( eg. turn expensive satellite into a flying brick).
An error is incorrect system behaviour which can lead to failure. Errors are normal situation in software, almost every function in every API could reports errors. Errors are very important in software, they allow programmers to handle it and thus prevents failures. We can distinguish two kinds of errors: value and timing. Value errors might be an incorrect system state or incorrect object value. Timing error includes total non-performance of the system (eg. infinite loop, permanent lock on mutex).
When error leads to failure ?
Failure may happen when requirements omit the possibility of error occurrence, eg. it is required to modify user data in the database, but unexpectedly database file is corrupted so data are not modified what breaks requirement. If the requirement would be extended to describe what to do when the database is corrupted, then the programmer could handle the error according to specification.
Another situation is when error is omitted or ignored by the programmer. I see this often when I make a code review. Generally programmers are not willing to check if function which they invoke reports error. I suppose it happen because we know how much code must be written additionally to “happy path” to handle error, and how the code become complicated. Frequent excuse is : “it is almost impossible that this function reports an error”, indeed it is true but in harsh space environment, where radiation flips bits, I prefer do not ignore possible errors.
When error is not detected it cannot be handled and leads to failure. For example we know that when program stuck in infinite loop it is error situation, but when we do not detect this we cannot prevent to show up the problem to user.
Summarizing error detection, handling and good specification are required to prevent change error to system failure.
The fault is a defect in the system that can cause an error. Before an error occurred fault is not visible for observers and users. When a fault occurs and it does not make any visible effects it said to be latent. In some circumstances latent fault causes something incorrect to happen and an error occurs. The fault is a physical defect in hardware, an imperfection in design and manufacturing, or bugs in software.
Kinds of faults
- Permanent: fault occurred and continues
- Transient: fault occurs at a certain time, and then system back to normal work
- Intermittent: recurring fault – sometimes occurs, sometimes it does not
- Benign: faulty easy to detect, cause an obvious error (eg. component crashes)
- Malicious: fault very hard to detect, component works inaccurate (eg. memory leaks, an incorrect result of computation)
All problems start from fault, fault causes an error, and if it is not handled properly cause a failure:
FAULT -> ERROR -> FAILURE
The biggest part of this post was taken by the description of errors, indeed errors are most important, all software developers work with errors in code, and the biggest part of code was taken by error handling.
- 2017) Fault-Tolerance Techniques for Spacecraft Control Computers. Wiley, . (
- 2007) Patterns for Fault Tolerant Software. Wiley, . (