Prev Up Next
Go backward to 9 Architectures for Survivability 2
Go up to Top
Go forward to 11 Security in Perspective

10 Reliability in Perspective

ENPM 808s
Information Systems Survivability:
10. Reliability in Perspective
- - - - - - - - - - - - - - - - - - -
Fault tolerance, other constructive uses of redundancy, robust mobile code, integration of security and reliability

Different degrees of fault tolerance, fault detection, fault avoidance, dynamic fault elimination, fault tolerance (e.g., fail-safe, fail-soft, fail-fast, fail-secure), fault recovery (forward or backward), self-checking components, ...

This lecture is in memory of my mentor, colleague, and friend, David Huffman, who died on October 7 1999.

Fault Tolerance
- - - - - - - - - - - - - - - - - - -
There is a vast literature of techniques for fault tolerance that this report does not attempt to replicate. Copious references are given in the arl-one report.

For example, techniques for increasing system reliability in response to hardware faults and communications failures are explored in general.

Once again demonstrating the desirability of a confluence of requirements and a corresponding confluence of techniques for combatting security and reliability problems along the lines of the reconvergence of availability requirements in the figure, consider the requirements for data integrity in the sense of no-unintended-change shown at the nodes designated by a sharp (#) in the figure. Data integrity can be enhanced through cryptographic integrity checks (typically to protect against malicious alterations) or error-correcting coding techniques (typically to protect against accidental garbling). However, an interesting recent special-purpose use of coding for detecting malicious tampering as well as accidental errors in once-writable optical disks is given by Blaum et al., taking advantage of the asymmetry inherent in certain once-writable storage media in which writing can only change the state of a bit in one direction (e.g., from a not previously written zero bit value to a written one bit, but never the reverse). This is another example of a cross-over implementation that can simultaneously address different sets of subrequirements stemming from otherwise independent-seeming major requirements. In such cases, considerable benefits can be obtained by recognizing the commonality among otherwise independent subrequirements and then providing a unified treatment in the design and implementation.

 Error Correcting Codes

Hamming Distance represents bit changes

 0      1      2      3      4      5
 00000--00001--00011--00111--01111--11111

Minimum   Correction/
Distance  Detection
--------  -----------------------
   1      No detection
   2      SINGLE-error DETECTION
   3      SINGLE-error CORRECTION
            OR DOUBLE-error DETECTION
   4      SINGLE-error CORRECTION *AND*
          DOUBLE-error DETECTION, 
            OR TRIPLE-error DETECTION with
               NO error correction
   5      DOUBLE-error CORRECTION, 
            OR SINGLE error CORRECTION AND
               DOUBLE-error DETECTION
            OR QUADRUPLE-error DETECTION
  etc.
 Huffman's View of the Hamming Code
      ----------
     /          \              
    /            \             
   /            --\------            
  /      1     /   \     \     
 |            /  3  |     \    
 |     ------/--    |      \   
 |    /     /   \   |       \  
  \  /      |    \  /        | 
   \/   5   |  7  \/    2    | 
   /\       |     /\         | 
  |  \       \   /  |       /  
  |   --------\--   |      /   
  |            \  6 |     /    
   \      4     \  /     /     
    \            -/------
     \           /              
      \         /               
       ---------                        

Hamming single-error correcting code
Regions are bit positions ($n=7$)
Circles are check sums ($r=3$),
and thus r check bits (bits 1,2,4).
Others ($k=n-r=4$) are info bits (3,5,6,7)
 

 Huffman's View of Hamming Code     
 (n, k, d) = (7, 4, 3)

Parity    Info        Error
 bits     bits        syndromes 
 1 2 4   3 5 6 7     ------------
 -----   -------     check:4 2 1
 0 0 0   0 0 0 0     ----\ 5 3 3
 1 1 1   0 0 0 1     bad | 6 6 5
 0 1 1   0 0 1 0     bit | 7 7 7
 1 0 0   0 0 1 1     -----------
 1 0 1   0 1 0 0     none| 0 0 0
 0 1 0   0 1 0 1       1 | 0 0 1
 1 1 0   0 1 1 0       2 | 0 1 0 
 0 0 1   0 1 1 1       3 | 0 1 1 
 1 1 0   1 0 0 0       4 | 1 0 0
 0 0 1   1 0 0 1       5 | 1 0 1
 1 0 1   1 0 1 0       6 | 1 1 0
 0 1 0   1 0 1 1       7 | 1 1 0
 0 1 1   1 1 0 0
 1 0 0   1 1 0 1
 0 0 0   1 1 1 0
 1 1 1   1 1 1 1

Huffman Graph Codes

     6        Edges are bit positions 
 :--------:     1,2,3,4,5,6 (n = 6)
 |\      /|   Redundant bits are
 | \1   / |     any max spanning tree
 |  \  /  |     such as 1,2,3 (r = 3)
5|   \/   |3  Information check bits are the
 |   /\   |     others, 4,5,6 (k = n-r = 3)
 |  /  \  |   Check sums are any r INDEPENDENT 
 | /2   \ |     loops, (1,4,5), (2,5,6), (3,4,5,6)
 |/      \|   The minimum Hamming distance of the 
 :--------:     code is the MINIMUM LOOP length.
     4
    Code            Error Syndromes
--------------      ------------------
 1 2 3  4 5 6       check:   1  2  3
parity  info        ----\    4  5  4
bits    bits        bad |    5  6  5
--------------      bit |          6
 0 0 0  0 0 0       ----------------
 0 1 1  0 0 1       none|    0  0  0
 1 1 1  0 1 0        1  |    1  0  0
 1 0 0  0 1 1        2  |    0  1  0
 1 0 1  1 0 0        3  |    0  0  1
 1 1 0  1 0 1        4  |    1  0  1
 0 1 0  1 1 0        5  |    1  1  1
 0 0 1  1 1 1        6  |    0  1  1
                    ??? |    1  1  0
QUESTION: What error patterns could 
have produced the syndrome 110?
HINT: There is more than one set of
such patterns.
Other Types of Redundant Coding
- - - - - - - - - - - - - - - - - - -
Asymmetric errors (e.g., dropping ones only),

Correlated errors (e.g., bursts)

Arithmetic errors (e.g., in addition or multiplication)

Self-checking circuits (e.g., with syndromes for detection)

Some Other Uses of Redundancy
- - - - - - - - - - - - - - - - - - -
Alternative routing in networks

Alternative sources of replicated data, with various different forms of replication (synchronous write-through, asynchronous write-through, less recent versions, backup and restoration, etc.)

Alternative computation sites, with various forms of replication (identical replication as hot standby, or with real-time voting; nonidentical replication; degraded standby)

Majority Voting
- - - - - - - - - - - - - - - - - - -
In the SIFT architecture discussed in the lecture 8, significant increases in reliability (four orders of magnitude in mean-time to failure) were achieved through majority voting. The replicated cost in processors (and memories, buses, and power supplies, sevenfold) is small relative to the cost of the airplane.

However, if hardware is consistently flawed, or a software-implemented algorithm is inherently flawed, or the replicated code is buggy, then voting is worthless. Note that SIFT used formal methods to verify critical software modules.

N-Version Programming
("Program Diversity")

- - - - - - - - - - - - - - - - - - -
Various experiments have been conducted with majority voting among separate programs all written to the same specification. This approach has some serious drawbacks:

Common-mode failures (Leveson and Knight, FTCS 16, 1986; Brilliant, Knight, and Leveson, IEEE Trans. Softw. Eng., November 1989 and February 1990; Shimeall and Leveson, IEEE TSE, February 1991)

One excellent programmer is likely to do better than three mediocre programmers.

Byzantine Faults
- - - - - - - - - - - - - - - - - - -
Byzantine fault mode: ARBITARY behavior.

A simple example of clock synchronization using three clocks broadcasting times to each other, each averaging among the three values. A and B are true clock sources (1130). C is Byzantine, e.g., maliciously misleading the others (sending different values to different neighbors).

  A->B A->C B->A B->C C->A C->B  AVE
A           1130      1200       1145
B 1130                     1100  1115
C      1130      1130            1130
Note: averaging clock algorithms are risky. The old 11-clock ARPAnet algorithm discarded the two most extreme values, but still got into trouble occasionally.
Byzantine Agreement
- - - - - - - - - - - - - - - - - - -
Simple example: Four clocks (4=3x1 + 1)
(Lamport, Shostak, Pease, ACM TOPLAS, July 1982 based on earlier 1980 paper)

First round: Each processor broadcasts its own private values.

Second round: Each processor broadcasts all private values from in the first round.

Each processor then compares the received private values of the other clocks. If the majority of those agree, that private value is accepted. You can work it out, or read the paper. In general, n=3k+1 requires k+1 rounds, and tolerates k arbitrarily Byzantine components. Many subsequent papers on Byzantine agreement.

Open Discussion Topics
- - - - - - - - - - - - - - - - - - -
We revisit the material on hierarchical layering of reliability and fault tolerance techniques (e.g., see Illustrative Reliability Threats, 61-63; Survivability Attributes at Different Layers, 74-75; Typical Defensive Measures, 97), and consider which of these techniques might be most useful. We also consider which techniques might have helped overcome some of the cases considered earlier.
Cases Discussed Previously
- - - - - - - - - - - - - - - - - - -
1980 ARPAnet collapse: error detection in memory; robust garbage collection algorithm;

Yorktown Divide-by-Zero: argument checking

Western power outage propagation: greater redundacy in resources and in control?

1997 Kansas City power outage: self-checking?

Black Hawk helicopter and many other EMI problems: shielding!

1990 AT&T collapse and 1998 AT&T packet-data frame-relay network collapse: better testing and configuration control on software upgrade?

Other cases


Prev Up Next