Fault-tolerant(FT) system is a software system which is still available and work without failure, even if faults occurred in its elements. Units of mitigations, in which errors mitigation are made, are the basic elements of the FT system.
FT systems are used on spacecraft software, where the possibility of human intervention is limited, and the availability of some parts of the system is decisive for the survival of the device.
Malicious faults, fail-stop and fail-safe failures
System or subsystem may less or more express its failure. The worst situation is when the system starts to works not according to its specification, but this situation is not detected and no flag is raised. We say then system suffers on malicious fault.
It is very important to immediately detect any fault which may cause failure and raise an error. When the error is visible for external observers we name this situation fail-stop.
When the system detects an error, but the error has no impact on its environment, then the system fails silently. Fail-silent failure is one where the unit presents correct results or no presents at all. The unit does not raise an error, don’t return incorrect results, just simply do not realize its service.
What are the units of mitigation ?
Unit of mitigation is a part of the fault-tolerant system, which can autonomously handle faults within its internals, prevents its errors to propagate to the rest of the system, and thus prevents to failure of the whole system.
The fault-tolerant system may consist of many units of mitigation, but in some cases, the whole system may be itself one unit of mitigation.
Units of mitigation interact with each other by defined interfaces. In case of a unit fail, the unit starts internally mitigation process what should be invisible for the rest of the system – the unit will fail silently.
Sometimes may happen that the mitigation process fails and the unit cannot back to an operational state. Units of mitigation required observers which take care of the mitigation process, and in case of problems take some actions.
Unit of mitigation defines the boundary for errors which protects against the leak of errors outside. For information purpose unit of mitigation may generate a single indication that error has occurred, but it should not be utilized by cooperating units to takes any recovery actions.
It is very important to introduce units of mitigation
The system which is not divided into units of mitigation will have problems with availability. When an error occurs, then too much of the system will be unavailable or recovery attempts will fail because the failure is slightly out of the scope of responsibility of the failed unit.
How to choose unit of mitigation ?
The most important thing which absolutely has to have each unit of mitigation is at least one method of recovery from its internal errors.
Unit of mitigation should be able to conduct self-checks to detect when it is not operating correctly.
Different parts of the system on different levels may become a unit of mitigation. Subsystem with its own hardware and software may become a unit of mitigation, but also a single function of software may be chosen.
It would be good if a unit of mitigation won’t share CPU or memory with other units. In such a case, it is easy to isolate unit and prevents spread its errors to other parts of the system.
Asynchronous events and fail-safe
Units of mitigation in case of error become unavailable for the rest of the system, but the elements of the systems shouldn’t know about the occurrence of the problem. It may be achieved with asynchronous communication: units of mitigation communicate together with asynchronous events. When an error occurs in one of the units, then it stops to service incoming events and starts the mitigation process. Rest of the system’s elements continue its work normally because do not expects immediate responses for events. When the process of error mitigation will finish, then the faulty unit will back to service of events. The buffers are required to do not lost events which are incoming to the unit during the mitigation process.
Similarity to small services architecture
Recently we can observe the popularity of software small services architecture. This phenomenon is connected with the growth of development of virtualization technology. Software systems consist of a number of processes executed on virtual environments – in software clouds. Reliability of such systems is important. Cooperation between services looks similar to cooperation between units of mitigation. Indeed small services architecture is a kind of implementation of a fault-tolerant system in which services are units of mitigation. Because of its popularity, many people invented solutions for small services architecture, then the solutions can be often adopted wider to general fault-tolerant systems e.g. concept of the load balancer, which balance requests between redundant services, can be adapted to work with redundant units of mitigation which don’t need to be serviced.
- 2007) Patterns for Fault Tolerant Software. Wiley, . (