Go backward to 4 Requirements for Survivability
Go up to Top
Go forward to 6 Approaches for Overcoming Deficiencies 1

5 Systemic Inadequacies and Other Deficiencies

ENPM 808s
Information Systems Survivability:
5. Systemic Inadequacies and Other Deficiencies
- - - - - - - - - - - - - - - - - - -
We consider a wide variety of deficiencies, particularly in the light of some of the more illustrative Illustrative Risks cases.

Deficiencies in Existing Systems
- - - - - - - - - - - - - - - - - - -
Requirements are often wrong, designs are flawed, code is buggy, maintenance is flaky.
Commercial products are often inadequate. Certain components are missing that are necessary for survivable systems and networks.
Further research and prototype development are needed.
Systems tend to be overly dependent on human frailty.
There are many other limitations as well.

COTS, Firewalls, etc.
- - - - - - - - - - - - - - - - - - -
Existing commercial products are seriously deficient, with respect to functionality, reliability, security, and survivability of operating systems, servers, backup systems, network protocols and their implementations, authentication, cryptographic protocols and their implementations, real-time monitoring and analysis; the same is true for ease of use, interoperability, compatibility, etc. These deficiencies are not well understood.
Firewalls and other boundary protection techniques tend to be permeable. E-mail, PostScript, html, ActiveX pass through.

Software Development
- - - - - - - - - - - - - - - - - - -
Software development practice is abysmal. The entire development cycle is riddled with problems.
Standards, evaluation criteria, and guidelines fall far short of what is needed.
System concepts are often not defined in advance.
Requirements tend to be poorly defined, incomplete.
Designs are typically incompletely specified and frequently inherently flawed.
Implementations are astoundingly often buggy and not consistent with their designs.
Debugging and testing are typically sloppy, and inherently incomplete. ("Testing demonstrates only the presence of bugs, never the absence of bugs.") Quality assurance is shunned.
Good existing research and development is very slow to appear in commercial systems.
Management is often oblivious to the problems of system development.
Routine maintenance and long-term system evolution are seldom planned in advance.
Risk management is seldom adequate.
Procurement practice is full of pitfalls.
Above all, people are inherently fallible.

The Patriot air-defense system. It was assumed in the design that the launch platforms would be moved and the computers rebooted once a day. Therefore, serious clock drifts were consistent with the requirements.
Pacemaker deaths due to electromagnetic interference
The Year-2000 Problem!
Whose shortsightedness do we blame for these?

Examples of Design Vulnerabilities
- - - - - - - - - - - - - - - - - - -
The ARPAnet collapse of 1980 (software design weakness, lack of redundancy in hardware, dropped bits in memory)
The C-Compiler Trojan Horse, Ken Thompson, 1984 (compiler is part of the trustworthiness perimeter)

The ARPAnet collapse of 1980

In the 1970s, the ARPAnet was a network linking primarily research computers, mostly within the United States, under the auspices of the Advanced Research Projects Agency (ARPA) of the Department of Defense (DoD). (It was the precursor of the Internet, which now links together many different computer networks worldwide.) On October 27, 1980, the ARPAnet experienced an unprecedented outage of approximately 4 hours, after years of almost flawless operation. This dramatic event is discussed in detail by Eric Rosen (Rosen 1981), and is summarized briefly here.

The collapse of the network resulted from an unforeseen interaction among three different problems: (1) a hardware failure resulted in bits being dropped in memory; (2) a redundant single-error-detecting code was used for transmission, but not for storage; and (3) the garbage-collection algorithm for removing old messages was not resistant to the simultaneous existence of one message with several different time stamps. This particular combination of circumstances had not arisen previously. In normal operation, each net node broadcasts a status message to each of its neighbors once per minute; 1 minute later, that message is then rebroadcast to the iterated neighbors, and so on. In the absence of bogus status messages, the garbage-collection algorithm is relatively sound. It keeps only the most recent of the status messages received from any given node, where recency is defined as the larger of two close-together 6-bit time stamps, modulo 64. Thus, for example, a node could delete any message that it had already received via a shorter path, or a message that it had originally sent that was routed back to it. For simplicity, 32 was considered a permissible difference, with the numerically larger time stamp being arbitrarily deemed the more recent in that case. In the situation that caused the collapse, the correct version of the time stamp was 44 [101100 in binary], whereas the bit-dropped versions had time stamps 40 [101000] and 8 [001000]. The garbage-collection algorithm noted that 44 was more recent than 40, which in turn was more recent than 8, which in turn was more recent than 44 (modulo 64). Thus, all three versions of that status message had to be kept.

From then on, the normal generation and forwarding of status messages from the particular node were such that all of those messages and their successors with newer time stamps had to be kept, thereby saturating the memory of each node. In effect, this was a naturally propagating, globally contaminating effect. Ironically, the status messages had the highest priority, and thus defeated all efforts to maintain the network nodes remotely. Every node had to be shut down manually. Only after each site administrator reported back that the local nodes were down could the network be reconstituted; otherwise, the contaminating propagation would have begun anew.

This case is considered further in Section 4.1 of Computer-Related Risks, and further explanation of the use of parity checks for detecting any arbitrary single bit in error is deferred until Section 7.7 of the book.

The C-Compiler Trojan Horse

Ken Thompson's 1983 Turing Award lecture published in 1984 included the now-classical Trojan horse involved a modification to the object code of the C compiler such that, when the login program was next compiled, a trapdoor would be placed in the login object code. No changes were made either to the source code of the compiler or to the source code of the login routine. Furthermore, the Trojan horse was persistent, in that it would survive future recompilations. Thus, it might be called a stealth Trojan horse, because there were almost no visible signs of its existence, even after the login trapdoor had been enabled. This case was, in fact, perpetrated by the developers of Unix to demonstrate its feasibility and the potential power of a compiler Trojan horse, and not as a malicious attack. But it is an extremely important case because of what it demonstrated.

Master-password trapdoor, with a buffer overflow,
The AT&T long-distance collapse of 1990

Master-password trapdoor
- - - - - - - - - - - - - - - - - - -
Master-password trapdoor, with a buffer overflow, Young and McHugh, p. 115-116 of Computer Related Risks

Young and McHugh describe an astounding flaw in the implementation of a password-checking algorithm that permitted bogus overly long master passwords to break into every system that used that login program, irrespective of what the real passwords actually were and despite the fact that the passwords were stored in an encrypted form. This rather subtle flaw and its exploitation are depicted in the figure. It involved the absence of strong typing and bounds checking on the field in the data structure used to store the user-provided password (field b in the figure). As a result, typing an overly long bogus password overwrote the next field (field c in the figure), which supposedly contained the encrypted form of the expected password. By choosing the second half of the overly long sequence to be the encrypted form of the first half, the attacker was able to induce the system to believe that the bogus password was correct.

Internal Data Structure             
= = = = = = = = = = = = = = =       
a. User login name                  
b. User-typed PW                    
c. Stored encrypted PW              
d. Encrypted form of typed PW       

The Password-Checking Algorithm
= = = = = = = = = = = = = = = = 
1. USER TYPES login name --> a
2. Stored encrypted PW --> c
3. USER TYPES PW --> b
4. Encrypt user typed PW --> d
5. Compare c and d

Step 1.     3.       2.          4.
  ---------------------------------------
  Store | Store | Stored    | Encrypted
  login | typed | encrypted | typed
  name  | PW    | PW        | PW    
  = = = = = = = = = = = = = = = = = = = = 
  |   a   |   b   |     c     |    d    |
  = = = = = = = = = = = = = = = = = = = =
        Step 5 compares c and d.

   The Master-Password Attack              
   = = = = = = = = = = = = = = = = =       
   EMEMY CHOOSES a string --- "opqrst".    
   ENEMY ENCRYPTS string, gets "uvwxyz"    
1. ENEMY TYPES any legitimate user name,    
     which is entered into field a.      
2. The stored encrypted PW goes into c.  
3. ENEMY types PW "opqrstuvwxyz" ---
     which is entered into b (and c!).   
4. Field b is encrypted and stored in d. 
5. Surprise!  Fields c and d match.

     The Perfect Match!
  ---------------------------
  |  a |   b  | c    | d    |
  ---------------------------
1.|Name|      |      |      |
2.|Name|      |ghijkl|      |
3.|Name|opqrst|uvwxyz|      |
4.|Name|opqrst|uvwxyz|uvwxyz|
5.             uvwxyz=uvwxyz

Master-password flaw and attack
(PW = password)

AT&T long-distance collapse of 1990
- - - - - - - - - - - - - - - - - - -
In mid-December 1989, AT&T installed new software in 114 electronic switching systems (Number 4 ESS), intending to reduce the overhead required in signaling between switches by eliminating a signal indicating that a node was ready to resume receiving traffic; instead, the other nodes were expected to recognize implicitly the readiness of the previously failed node, based on its resumption of activity. Unfortunately, there was an undetected latent flaw in the recovery-recognition software in every one of those switches.

On January 15, 1990, one of the switches experienced abnormal behavior; it signaled that it could not accept further traffic, went through its recovery cycle, and then resumed sending traffic. A second switch accepted the message from the first switch and attempted to reset itself. However, a second message arrived from the first switch that could not be processed properly, because of the flaw in the software. The second switch shut itself down, recovered, and resumed sending traffic. That resulted in the same problem propagating to the neighboring switches, and then iteratively and repeatedly to all 114 switches. The hitherto undetected problem manifested itself in subsequent simulations whenever a second message arrived within too short a time. AT&T finally was able to diagnose the problem and to eliminate it by reducing the messaging load of the network, after a 9-hour nationwide blockade. With the reduced load, the erratic behavior effectively went away by itself, although the software still had to be patched correctly to prevent a recurrence. Reportedly, approximately 5 million calls were blocked.

The ultimate cause of the problem was traced to a C program that contained a break statement within an if clause nested within a switch clause. This problem can be called a programming error, or a deficiency of the C language and its compiler, depending on your taste, in that the intervening if clause was in violation of expected programming practice.

Some Security Deficiencies
- - - - - - - - - - - - - - - - - - -
Weak operating systems, network protocols, compilers, administrative tools
Poor authentication, poor authorization, lack of layered defenses, few denial-of-service defenses, poor monitoring and real-time analysis,
Poor crypto and crypto embeddings
Poor software-engineering practice, terrible human-interface design
Poor operational/admin practice

Human Screw-ups
- - - - - - - - - - - - - - - - - - -
Shuttle Challenger loss. The decision was made to launch in cold weather despite known risks of O-ring failure. Environmental problem. Specification lacked foresight?
Patriot inaccuracies. Rebooting the platform daily would have overcome the excessive clock drift. Requirement failure? Software bug? All of the above...
Aegis system's role in U.S.S. Vincennes' shootdown of an Iranian Airbus (human interface problems, limited information on screen). Requirement specification failure? User interface design? Operator panic? All of the above.

More Human Screw-ups
- - - - - - - - - - - - - - - - - - -
Numerous airplane disasters blamed on pilot error and air-traffic controller error KAL 007. NW 255. British Midland. Airbus A320s.
Numerous other transportation accidents: railroad crashes, Exxon Valdez on autopilot.

Blame is Often Widely Distributable
- - - - - - - - - - - - - - - - - - -
Therac 25: omitted hardware interlock, reliance on inadequate software, operational practice, developer intransigence, etc.
Vincennes' Aegis: archaic hardware and software, bad human interface, use of system outside of its intended operation, etc.
Y2K as a long-term pervasive problem: incredible short-sightedness, even now.
Computer security vulnerabilities: endemic lack of understanding, false economies, lack of management and government pressures, etc.

Discussion Topics: Inadequacies
- - - - - - - - - - - - - - - - - - -
What are the most serious inadequacies in systems today?
On what kind of systems might you be willing to trust your life? Are you confident you could do better than the norm?
What difficulties arise in trying to allocate blame?
What good and what harm might arise from attempting to allocate blame?
Apply this discussion to the Year-2000 problem!
What have you learned thus far, and what are your preliminary conclusions?

Reading for the Next Two Class Periods
- - - - - - - - - - - - - - - - - - -
Read Chapter 5 of arl-one on approaches for overcoming such deficiencies:
http://www.csl.sri.com/neumann/arl-one.html