Fault tolerant(FT) system is a software system which is still available and work without failure, even if faults occurred in its elements. Units of mitigations, in which errors mitigation are made, are the basic elements of the FT system.
FT systems are used on spacecrafts software, where possibility of human intervention is limited, and availability of some parts of the system is decisive for the survival of the device.
Malicious faults, fail-stop and fail-safe failures
System or subsystem may less or more express its failure. The worst situation is when system starts to works not according to its specification, but this situation is not detected and no flag is raised. We say then system suffers on malicious fault.
It is very important to immediately detects any fault which may cause failure and raise an error. When the error is visible for external observers we name this situation fail-stop.
When system detects an error, but the error has no impact on its environment, then the system fails silently. Fail-silent failure is one where the unit presents correct results or not presents at all. The unit not raise an error, don’t return incorrectly results, just simply do not realize its service.
What are the units of mitigation ?
Unit of mitigation is a part of fault tolerant system, which can autonomously handle faults within its internals, prevents its errors to propagate to the rest of the system, and thus prevents to failure of the whole system.
Fault tolerant system may consist many units of mitigation, but in some cases whole system may be itself one unit of mitigation.
Units of mitigation interacts between each other by defined interfaces. In case of an unit fail, the unit starts internally mitigation process what should be invisible for the rest of system – the unit will fail silently.
Sometimes may happen that mitigation process fails and unit cannot back to operational state. Units of mitigation required observers which take care on mitigation process, and in case of problems take some actions.
Unit of mitigation defines boundary for errors which protects against leak of errors outside. For information purpose unit of mitigation may generate single indication that error has occurred, but it should not be utilize by cooperate units to takes any recovery actions.
It is very important to introduce units of mitigation
The system which is not divided into units of mitigation will have problems with availability. When error occur, then to much of the system will be unavailable or recovery attempts will fail because the failure is slightly out of scope of responsibility of failed unit.
How to choose unit of mitigation ?
The most important thing which absolutely has to have each unit of mitigation is at least one method of recovery from its internal errors.
Unit of mitigation should be able to conduct self checks to detect when it is not operating correctly.
Different parts of system on different levels may become unit of mitigation. Subsystem with its own hardware and software may become unit of mitigation, but also single function of software may be chosen.
It would be good if unit of mitigation wont share CPU or memory with other units. In such case it is easy to isolate unit and prevents spread its errors to other parts of system.
Asynchronous events and fail-safe
Units of mitigation in case of error become unavailable for the rest of system, but the elements of the systems shouldn’t know about occurrence of the problem. It may be achieved with asynchronous communication: units of mitigation communicate together with asynchronous events. When error occur in one of the units, then it stops to service incoming events and starts mitigation process. Rest of system’s elements continue its work normally, because do not expects immediate responses for events. When process of error mitigation will finish, then faulty unit will back to service of events. The buffers are required to do not lost events which are incoming to the unit during mitigation process.
Similarity to small services architecture
Recently we can observe popularity of software small services architecture. This phenomenon is connected with grow of development of virtualization technology. Software systems consist number of processes executed on virtual environments – in software clouds. Reliability of such systems is important. Cooperation between services looks similar to cooperation between units of mitigation. Indeed small services architecture is a kind of implementation of fault tolerant system in which services are units of mitigation. Because of its popularity many people invented solutions for small services architecture, then the solutions cane be often adopted wider to general fault tolerant systems e.g. concept of load balancer, which balance requests between redundant services, can be adopted to work with redundant units of mitigation which don’t need to be services.
- 2007) Patterns for Fault Tolerant Software. Wiley, . (