Recovery blocks: fault tolerance pattern

It is hard to achieve fault tolerance computer system without hardware redundancy, but there are some methods which can improve system reliability only with software redundancy, one of them is “recovery blocks”.

Elements of Recovery Blocks

Recovery Block

Recovery blocks is a system of cooperation of a few software elements which realize the same task. Each software has additionally “Acceptance Test” code. The “Acceptance Test” evaluates tasks results and return information if the software execution fails or not. A software version and its “Acceptance Test” creates a “Recovery block”. “Recovery block” takes the input value, realize task and returns output value and information if the software fails or not.

Recovery Block schema

It is better to use “The Either Pattern” – return only Output value or Error:

Recovery block with either pattern

Selection Logic

A sorted set of recovery blocks has connected its outputs to “Selection Logic” which drives software executions and selects the whole system output.

Recovery Blocks with Selection Logic

Selection logic takes first block’s output, and if it is not an error, then pass it as a whole system output. When the block returns an error, then selection logic will ask the second block in the set for computation output if the output is correct then will be passed, otherwise, next block will be asked for a result. If all blocks return an error, then the whole system will return an error.

Input values latch

There is one element left -input values latch. Each block has to make computation in one cycle with the same set of input values. It means that the input values must be held somewhere in the systems, and stay unchanged until Selection Logic pass output (value or error).

Full schema of recovery blocks

Summarize Recovery Blocks

“Recovery blocks” is an example of fault tolerance pattern, which utilize time redundancy – the additional time is utilized for redundant computations. At first glance, it seems that also software redundancy exists there, but not exactly. Because each software blocks may be the same version of the software but with a different configuration.

The pattern itself represents the base idea of the solution, and can be extended and modified a little to utilize parallel outputs computing, N-Version programming or voting. I will explain this modification in one of my next posts.

References

  • Mengfei Yang, Gengxin Hua, Yanjun Feng, Jian Gong (2017) Fault-Tolerance Techniques for Spacecraft Control Computers. Wiley, .
  • Robert Hanmer (2007) Patterns for Fault Tolerant Software. Wiley, .

Leave a Reply

Your email address will not be published. Required fields are marked *