Redundancy refers to multiplication capabilities in a system, which enables for more rapid error recovery and fault treatment. Without redundancy is impossible to build fault tolerant system, so it would be good to understand how redundancy can change computer systems to be more resistant to faults.
In this type of redundancy computer system consist two or more independent complete computers with theirs owns processor, memories and periphery. The computers of the system can cooperate together in three ways:
All the computers works in parallel in the same time. Results of computations from each of the computers are compared to each other, if ale the results are not equal, then it means that fault occurred in one of the subsystem, and the correct output of the whole system is chosen with using of predefined algorithm.
Only one of the computers is active, and if it fails, then another computer is started to continue the system’s tasks.
Incorporates both static and dynamic approach.
The hardware redundancy is a wide used method to ensure fault tolerance of computer systems. It is a base of other types of redundancy. The famous examples of hardware redundancy are NASA space shuttle flight computer and SpaceX Dragon flight computer. Both systems utilize static approach of hardware redundancy.
This redundancy means run in a system different software executed the same task.
Build reliable software is a challenge, generally we need to accept fact that software is not reliable. We have many examples how big and professionals software companies release products which fails dramatically (e.g. fail of Arian 5 rocket , disasters of Boeing 737 MAX, fail of first orbital mission of Boeing Starliner). We already know, that there is no team which can release software without bugs, we can try to utilize more than one teams to deliver the same software (with the same specification) completely independently to each other, assuming that the same bugs won’t occurred in the all products. This approach is named ‘N Versions Programming’. All software versions are executed during mission, and they results are compared to each other to detect and mitigate faults.
“N-Versions Programming” may means also situation when different versions of the software executed the same task are maintained by the same team in the same time, but the quality and complexity of the software are different e.g. one software is very complicated and very efficient (low CPU and memory usage), and the second version is very simple and more reliable (less number of code lines) but not efficient. I such scenario if complicated software will fail, then system may continue work with simple software.
Fault tolerance is achieved with extra information, which is not required to realize system tasks. The one of the example is ECC – error correcting code, this additional value attached to the information is not required to execute software’s task (it uses only data without additional code), but the ECC allows to check data correctness, and also may be used to correct corrupted data. Another example of information redundancy is data backup – the same information is duplicated to reduce possibility of data lost.
Fault tolerance is achieved in time – we spend more time than we need to execute task, and the additional time is utilize to detect faults and fail mitigation. For example we can stop to execute task for a while to survive transient fault, we can execute the same task more than once one by one to confirm correctness of computation. I think the wide known time redundancy is slowing down our laptops CPU-s when they become overheated.
Redundancy is not for free, it adds additional resources to systems to improve its reliability. Extending system about one more computer, or doubled available memory, may drastically complicate the system design. Performance, weight, size of the system may be affected, as well as the cost of design and implementation. Appropriate redundancy method must be selected to achieve a system’s goals.
- 2017) Fault-Tolerance Techniques for Spacecraft Control Computers. Wiley, . (
- 2007) Patterns for Fault Tolerant Software. Wiley, . (