Recovery blocks: fault tolerance pattern
It is hard to achieve fault tolerance computer system without hardware redundancy, but there are some methods which can improve system reliability only with software redundancy, one of them is “recovery blocks”.
Elements of Recovery Blocks
Recovery blocks is a system of cooperation of a few software elements which realize the same task. Each software has additionally “Acceptance Test” code. The “Acceptance Test” evaluates tasks results and return information if the software execution fails or not. A software version and its “Acceptance Test” creates a “Recovery block”. “Recovery block” takes the input value, realize task and returns output value and information if the software fails or not.
It is better to use “The Either Pattern” – return only Output value or Error:
A sorted set of recovery blocks has connected its outputs to “Selection Logic” which drives software executions and selects the whole system output.
Selection logic takes first block’s output, and if it is not an error, then pass it as a whole system output. When the block returns an error, then selection logic will ask the second block in the set for computation output if the output is correct then will be passed, otherwise, next block will be asked for a result. If all blocks return an error, then the whole system will return an error.
Input values latch
There is one element left -input values latch. Each block has to make computation in one cycle with the same set of input values. It means that the input values must be held somewhere in the systems, and stay unchanged until Selection Logic pass output (value or error).
Summarize Recovery Blocks
“Recovery blocks” is an example of fault tolerance pattern, which utilize time redundancy – the additional time is utilized for redundant computations. At first glance, it seems that also software redundancy exists there, but not exactly. Because each software blocks may be the same version of the software but with a different configuration.
The pattern itself represents the base idea of the solution, and can be extended and modified a little to utilize parallel outputs computing, N-Version programming or voting. I will explain this modification in one of my next posts.
- 2017) Fault-Tolerance Techniques for Spacecraft Control Computers. Wiley, . (
- 2007) Patterns for Fault Tolerant Software. Wiley, . (