Updated April 11, 2023

Fault Tolerance

Introduction to Fault Tolerance

‘Fault Tolerance’ in software is a phenomenon where the software is capable of fixing itself or continuing the normal operations in the occurrence of any glitches or error in the system, provided that full coverage on the functionality is maintained as specified in the required documentation. The reasons behind these faults in the software system can be a fault from within, from other integrated systems, from the downstream application, or from any other external aspects like the system hardware, network, etc. This is one of the factors based on which software is estimated to be a quality one or not. Hence it is important that every software program consists of fault tolerance.

Fault Tolerance Techniques

A Fault in any software system, usually, happens due to the gaps left unnoticed during the design phase. Based on this, the fault tolerance techniques are identified into two different groups, that is, the Single Version Technique and the Multi-Version Technique. There can be plenty of techniques implemented under each of these categories, and a few of the techniques often used by the programmers are,

Single Version Technique
Software Structure and Actions
Error Detection
Exception Handling
Checkpoint and Restart
Process Pairs
Data Diversity
Multi-Version Technique
Recovery Blocks
N – Version Programming
N Self – Checking Programming
Consensus Recovery Blocks

1. Software Structure and Actions

When the software system is one single block of code, it is logically more vulnerable to failure. Because, when one tiny error occurs in the program, the whole system will be brought down. Hence, it is crucial for the software system should be structured in a modular form, where the functionality is covered in separate modules. In the case of failure, each module should hold specific instructions on how to handle it and let the other modules run as usual, instead of passing on the failure from module to module.

2. Error Detection

Error Detection is a fault tolerance technique where the program locates every incidence of error in the system. This technique is practically implemented using two attributes, namely, self-protection and self-checking. The Self-Protection attribute of error detection is used for spotting the errors in the external modules, whereas the Self-Checking attribute of error detection is used for spotting the errors in the internal module.

3. Exception Handling

Exception Handling is a technique used for redirecting the execution flow towards the route to recovery whenever an error occurs in the normal functional flow. As a part of fault tolerance, this activity is performed under three different software components, such as the Interface Exception, the Local Exception and the Failure Exception.

4. Checkpoint and Restart

This is one of the commonly used recuperation methods for single version software systems. The Checkpoint and Restart fault tolerance technique can be used for the events like run-time exceptions, that is, a malfunction takes place during the run-time and when the execution is complete there is no record of the error happening. For this case, the programmer can place checkpoints in the program and instruct the program to restart immediately right from the occurrence of the error.

5. Process Pairs

Process Pair technique is a method of using the same software in two different hardware units and validating the functional differences in order to capture the faulty areas. This technique functions on top of the checkpoint and restart technique, as similar checkpoints and restart instructions are placed in both systems.

6. Data Diversity

Data Diversity technique is typically a process where the programmer passes a set of input data, and places checkpoints for detecting the slippage. The commonly used Data Diversity models are ‘Input Data Re-Expression’ model, ‘Input Data Re-Expression with Post-Execution Adjustment’ model, and ‘Re-Expression via Decomposition and Recombination’ model.

7. Recovery Blocks

Recovery Block technique for multiple version software Fault Tolerance involves the checkpoint and restart method, where the checkpoints are placed before the fault occurrence, and the system is instructed to move on to next version to continue the flow. It is carried out in three areas, that is, the main module, the acceptance tests, and the swap module.

8. N – Version Programming

The N – Version programming technique for the multi – version fault tolerance is the commonly used method when the there is a provision for testing multiple code editions. The recovery is made from executing all the versions and comparing the outputs from each of the versions. This technique also involves the acceptance test flow.

9. N Self–Checking Programming

N Self – Checking Programming is a combination technique of both the Recovery block and the N – version Programming techniques, which also calls for the acceptance test execution. It is performed by the sequential and the parallel execution of various versions of the software.

10. Consensus Recovery Blocks

In this Technique, in addition to the Recovery Block and the N- Version Programming techniques, the decision algorithm technique is combined for handling and recovering the inaccuracy in the system. This combination of all the efficient fault tolerance techniques gives a much more consistent method of Fault tolerance.

Advantages of Fault Tolerance

The below are the few advantages of employing the Fault Tolerance techniques onto the software systems,

Ensures a successful system Recovery route
Achieves Functional Stability
Higher System Security due to the clearly set recuperation plans
Very less or No Redundancy
Allows Parallel processing and Multi-tasking
Cost-Effective, as the faults are identified then and there
Higher Efficacy and system Performance
Data accessibility is persist
Overall, the software is observed to be Robust
Modularity helps in lower reliability amongst the code units
Doesn’t let the software system to collapse

Conclusion

Fault Tolerance is a crucial part of the software programming or the software development process, as it is mainly used to identify the imperfections in the system. After the build is completed by incorporating the Fault Tolerance methods, it is highly unlikely for the software to cause a failure or breakdown. Hence it is always considered to be an essential part of the process of developing an error-free software system, especially now, since the technological growth is huge and no one wishes to face a loss in productivity due to the software program errors.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage