Redundancy refers to multiplication capabilities in a system, which enables for more rapid error recovery and fault treatment. Without redundancy is impossible to build a fault-tolerant system, so it would be good to understand how redundancy can change computer systems to be more resistant to faults.
In this type of redundancy, computer system consist of two or more independent complete computers with their owns processor, memories and periphery. The computers of the system can cooperate together in three ways:
All the computers work in parallel at the same time. Results of computations from each of the computers are compared to each other, if ale the results are not equal, then it means that fault occurred in one of the subsystems, and the correct output of the whole system is chosen with using of a predefined algorithm.
Only one of the computers is active, and if it fails, then another computer is started to continue the system’s tasks.
Incorporates both static and dynamic approach.
The hardware redundancy is a widely used method to ensure fault tolerance of computer systems. It is a base of other types of redundancy. The famous examples of hardware redundancy are NASA space shuttle flight computer and SpaceX Dragon flight computer. Both systems utilize a static approach of hardware redundancy.
This redundancy means run in a system different software executed the same task.
Build reliable software is a challenge, generally, we need to accept fact that software is not reliable. We have many examples of how big and professionals software companies release products which fail dramatically (e.g. fail of Arian 5 rocket , disasters of Boeing 737 MAX, fail of the first orbital mission of Boeing Starliner). We already know, that there is no team which can release software without bugs, we can try to utilize more than one teams to deliver the same software (with the same specification) completely independently to each other, assuming that the same bugs won’t occur in all products. This approach is named ‘N Versions Programming’. All software versions are executed during the mission, and the results are compared to each other to detect and mitigate faults.
“N-Versions Programming” may mean also situation when different versions of the software executed the same task are maintained by the same team in the same time, but the quality and complexity of the software are different e.g. one software is very complicated and very efficient (low CPU and memory usage), and the second version is very simple and more reliable (less number of code lines) but not efficient. I such a scenario if complicated software will fail, then the system may continue work with simple software.
Fault tolerance is achieved with extra information, which is not required to realize system tasks. One of the examples is ECC – error-correcting code, this additional value attached to the information is not required to execute software’s task (it uses only data without additional code), but the ECC allows to check data correctness, and also may be used to correct corrupted data. Another example of information redundancy is data backup – the same information is duplicated to reduce the possibility of data lost.
Fault tolerance is achieved in time – we spend more time than we need to execute a task, and the additional time is utilized to detect faults and fail mitigation. For example, we can stop to execute the task for a while to survive transient fault, we can execute the same task more than once one by one to confirm the correctness of computation. I think the wide known time redundancy is slowing down our laptops CPU-s when they become overheated.
Redundancy is not for free, it adds additional resources to systems to improve its reliability. Extending system about one more computer, or doubled available memory, may drastically complicate the system design. Performance, weight, size of the system may be affected, as well as the cost of design and implementation. Appropriate redundancy method must be selected to achieve a system’s goals.
- 2017) Fault-Tolerance Techniques for Spacecraft Control Computers. Wiley, . (
- 2007) Patterns for Fault Tolerant Software. Wiley, . (