Recovery blocks: fault tolerance pattern

It is hard to achieve fault tolerance computer system without hardware redundancy, but there are some methods which can improve system reliability only with software redundancy, one of them is “recovery blocks”.

Elements of Recovery Blocks

Recovery Block

Recovery blocks is a system of cooperation of a few software elements which realize the same task. Each software has additionally “Acceptance Test” code. The “Acceptance Test” evaluates tasks results and return information if the software execution fails or not. A software version and its “Acceptance Test” creates a “Recovery block”. “Recovery block” takes input value, realize task and returns output value and information if software fails or not.

Recovery Block schema

It is better to use “The Either Pattern” – return only Output value or Error:

Recovery block with either pattern

Selection Logic

A sorted set of recovery blocks has connected its outputs to “Selection Logic” which drives software executions and selects the whole system output.

Recovery Blocks with Selection Logic

Selection logic takes first block’s output, and if it is not an error, then pass it as a whole system output. When the block returns an error, then selection logic will ask second block in the set for computation output, if the output is correct then will be passed, otherwise next block will be asked for result. If all blocks return error, then whole system will return error.

Input values latch

There is one element left -input values latch. Each block has to make computation in one cycle with the same set of input values. It means that the input values must be hold somewhere in the systems, and stay unchanged until Selection Logic pass output (value or error).

Full schema of recovery blocks

Summarize Recovery Blocks

“Recovery blocks” is an example of fault tolerance pattern, which utilize time redundancy – additional time is utilized for redundant computations. At first glance it seems that also software redundancy exists there, but not exactly. Because each software blocks may be the same version of software but with different configuration.

The pattern itself represent base idea of solution, and can be extended and modified a little to utilize parallel outputs computing , N-Version programming or voting. I will explain these modification in one of my next post.

References

  • Mengfei Yang, Gengxin Hua, Yanjun Feng, Jian Gong (2017) Fault-Tolerance Techniques for Spacecraft Control Computers. Wiley, .
  • Robert Hanmer (2007) Patterns for Fault Tolerant Software. Wiley, .

Leave a Reply

Your email address will not be published. Required fields are marked *