Practical Architectures for
Survivable Systems and Networks
(Phase-Two Final Report)
30 June 2000
©Copyright 2000 SRI International,
and freely available for noncommercial reuse
Peter G. Neumann
Computer Science Laboratory
SRI International, Room EL-243
333 Ravenswood Avenue
Menlo Park CA 94025-3493
Telephone: 1-650-859-2375
Fax: 1-650-859-2844
E-mail: Neumann@CSL.sri.com
http://www.csl.sri.com/neumann
Acknowledgment of Support and Disclaimer: This report is based upon work
supported by the U.S. Army Research Laboratory (ARL), under Contract
DAKF11-97-C-0020. Any opinions, findings, conclusions, or recommendations
expressed herein are those of the author and do not necessarily reflect the
views of the U.S. Army Research Laboratory. The Government contact is
Anthony Barnes (BarnesA@arl.mil), 1-732-427-5099.
[NOTE: This report represents a somewhat personal view of some
potentially effective approaches toward developing, configuring, and
operating highly survivable system and network environments.
It is accessible on-line in three forms, the first two for printing,
the third for Web browsing in nicely crosslinked html:
http://www.csl.sri.com/neumann/survivability.ps
http://www.csl.sri.com/neumann/survivability.pdf
http://www.csl.sri.com/neumann/survivability.html
Constructive feedback is always welcome. Many thanks. PGN]
Abstract: This report summarizes the analysis of information system
survivability. It considers how survivability relates to other requirements
such as security, reliability, and performance. It considers a hierarchical
layering of requirements, as well as interdependencies among those
requirements. It identifies inadequacies in existing commercial systems and
the absence of components that hinder the attainment of survivability. It
recommends specific architectural structures and other approaches that can
help overcome those inadequacies, including research and development
directions for the future. It also stresses the importance of system
operations, education, and awareness as part of a balanced approach toward
attaining survivability.
The field of endeavor addressed in this report is inherently open ended.
New research results and new software components are appearing at a rapid
pace. For this reason, the report stresses fundamentals, and is intended to
be a guide to certain principles and architectural directions whose
systematic use can lead to systems that are meaningfully survivable. In
that spirit, the report is intended to serve as a coherent resource from
which many further resources can be gleaned by following the cited
references and URLs.
The report is relatively modest in its intent. It does not try to solve all
the problems of how to design, implement, administer, maintain, and use
highly survivable systems and networks. Those problems require
future research and greater discipline in system development and
operations. Nevertheless, the report represents a substantive starting
point.
The document can be useful to developers of systems with critical
requirements. It can also be useful in connection with anyone wanting to
teach or learn the basics of system and network survivability. The Army
Research Laboratory and the Software Engineering Institute have sponsored
workshops on Information Survivability (InfoSurv). In part as a result of
Paul Walczak's efforts at ARL relating to this project, several universities
(Maryland, Pennsylvania, Tennessee-Knoxville, Georgia Tech) have had courses
using the contents of our interim first-phase report (January 1999).
Appendix A characterizes some of the curriculum issues relating to
survivability. We have intentionally not tried to spell out specific course
materials lecture by lecture, but rather have tried to provide basic
directions that such courses might address.
Printable versions of this document contain URLs for many relevant Web
resources. The browsable html version may be preferable for Web users,
because it contains hot links to those resources.
Systems and networks with critical survivability requirements are extremely
difficult to specify, develop, procure, operate, and maintain. They tend to
be subject to many threats, laden with risks, and difficult to use wisely.
By systems, we include operating systems, dedicated application systems,
systems of systems, and networks viewed as systems.
We begin with several observations.
- Commercially available mass-market software systems tend to be very
poor with respect to security and reliability. They are even worse with
respect to overall system and network survivability, which depends
critically on security and reliability. Software components are often
incompatible with one another, even when obtained from the same developer.
Interoperability and reusability are much less than what should reasonably
be expected. Compatibility with legacy systems is driving many systems into
their lowest common denominators. Proprietary systems and proprietary
interface standards generally make integration with other system types very
difficult. They also make open analysis of those systems very difficult
(promoting security by obscurity
rather than security by in-depth analysis). Long-term compatible
evolvability is a serious problem. The reasons for this overall situation are
widespread, and the blame must be shared among developers, governments, user
communities, inadequate procurement processes, and the lack of market
incentives (for example).
- The U.S. Government and the defense establishment have become
increasingly dependent on commercially available systems -- with all their
warts, blemishes, and fundamental shortcomings. Unfortunately, many of
these systems are not advancing fast enough to meet the needs of critical
applications. A possible alternative for critical applications is gaining
some credibility: some sort of disciplined efforts to make certain
source-available software components substantially more robust, and to
generate new components -- particularly where proprietary off-the-shelf
products are not adequate. However, that approach has its own
potential risks, which deserve to be studied and overcome where possible.
- System development practice is in general abysmal. (Recent examples
of fiascos relating to U.S. Government system developments include
the cancellations of the FAA Air Route Traffic Control System contract,
the IRS Tax Systems Modernization effort, and the first FBI NCIC-2000
fingerprint system, representing the waste of billions of dollars.) A
representative example of the very bad state of practice and the great
difficulties inherent in trying to advance the state of commercial systems
is given by the mere existence and pervasiveness of the Year-2000 problem,
with the resulting enormous costs to fix billions of lines of
code in the presence of lingering doubts as to whether those attempts
would be successful up to and even after the actual event. The Y2K
problem is just one more example of short-sighted system development
practice, rather than a unique problem unto itself, and is realistically
less complex than the overall security/survivability problems.
- The range of potential threats to survivability that must be
considered is enormous, including hardware malfunctions, software flaws,
environmental hazards, and malicious and accidental human acts. The level
of awareness of threats, vulnerabilities, and risks is generally deplorable.
Offensive information warfare is becoming recognized as a serious potential
threat to survivability, but attempts to develop corresponding defenses are
lagging badly. Defensive information warfare seems to be a misnomer for
what should otherwise be considered pervasively as sensible standard
practice: if we were to develop and use systems and networks that are
meaningfully survivable, secure, reliable, and so on, it would dramatically
decrease the threats of offensive information warfare against us. Many
organizations (perhaps most visibly, the Pentagon) have been deluged with
attacks on their Web sites and computer-communication infrastructures,
including penetrations, denials of service, and Trojan horses such as
Melissa and ILOVEYOU that propagate by people reading e-mail. And yet
very few substantive actions are being taken to improve the information
infrastructures technologically.
- The 1997 report of the President's Commission on Critical
Infrastructure Protection (PCCIP) touches on the
tip of an enormous iceberg. It observes that the survivability and
integrity of the critical national infrastructures (such as
telecommunications, power, energy, transportation, and financial services)
are very much at risk, and furthermore that these national infrastructures
are highly interdependent on one another. Whereas the report recognizes
that the critical national infrastructures depend critically on computers
and communications, their recommendations touch only lightly on what might
be done to strengthen the underlying computer-communication
infrastructures. (Although the Commission's initial focus in
telecommunications was narrowly aimed at the public switched network, the
Commission's members did ultimately attempt to broaden their concerns to
include the Internet and computer networking.) The PCCIP recommendations
are important and must be considered very carefully, although there is a
tendency toward increasing bureaucracy. (See Section 5.19 for
more on the Commission and subsequent actions, and Section 9.7
for how this report addresses the problems identified by the PCCIP
report.)
- Although significant research and prototype-development
efforts could help minimize many of the existing problems and many new
problems that need to be addressed, valuable R&D advances related to
security, reliability, and survivability have in general been exceedingly
slow in finding their way into practice. That must change.
The above observations motivate a simple statement of the goals of our
project and of this report. To surmount these realities, we seek to
- Make more explicit the requirements for survivability and its
necessary subtended properties such as security, reliability, and
performance, and characterize the interactions among the different
subrequirements
- Identify functionality whose absence currently prevents adequate
satisfaction of those requirements and recommend the development of specific
infrastructural components that are currently missing or not commercially
available
- Explore techniques for designing and developing highly survivable systems
and networks, despite the presence of untrustworthy subsystems and
untrustworthy people -- where untrustworthiness may encompass the lack of
reliability, integrity, and correctness of behavior on the part of systems
and people
- Recommend specific architectural structures and
structural architectures that can lead to survivable systems and networks
capable of either preventing or tolerating a wide range of threats
- Explore operational principles that can enhance survivability
- Recommend directions for the future, including research and
development
It is absolutely essential to realize that there are no easy answers for
achieving survivable systems and networks. This report does not pretend to
be a cookbook. Cookbook approaches are doomed to fail, because of the
intrinsic multidimensionality of the survivability problems, the inadequacies
of the existing infrastructures, the fact that the underpinnings are
continually in flux, and the fact that no one solution or small set of
solutions fits all applications. We cannot merely follow tried-and-true
recipes, because no foolproof recipes exist. For these reasons, we
emphasize here the need for in-depth understanding of the basic issues, the
recognition and pervasive adherence to sensible principles, the fundamental
importance of insights gleaned from past experience, and the urgency of
pursuing significant R&D approaches and incorporating them into practical
systems. Thus, we include many references to primary literature sources,
with the hopes that diligent readers will pursue them. The successful
integration of the best of these concepts is absolutely fundamental to the
development, procurement, and use of systems and networks that can fulfill
requirements for high survivability.
To satisfy the goals stated above, we take a strongly system-oriented
approach. Survivability of systems and networks is not an intrinsic
low-level property of subsystems in the small. Instead, it is an emergent property --
that is, a property that has meaning primarily in the overall context to
which it relates. Emergent properties can be defined in terms of the
concepts of their own layers of abstraction, but generally not completely in
terms of individual components at lower layers. That is, an emergent
property is a property that arises as a result of the composition of
lower-layer components and that is not otherwise evident. Emergent
properties may be positive (such as human safety and system survivability)
or negative (such as unforeseen interactions among components -- for
example, covert channels that exist only when
components are combined). Simply composing a system or network out of its
components provides no certainty whatever that the resulting whole will work
as desired, even if the components themselves seem to behave properly. One
of the most important challenges confronting us is to be able to derive the
emergent properties of a system in the large from the properties of its
components and from the manner in which they are integrated.
There is an important body of work devoted to dependable systems (especially
in Europe) and to high-assurance systems (especially in the U.S.). These
are really aspects of the same thing. A system should be capable of
satisfying its requirements, dependably and with appropriate assurance,
whatever those requirements are. Survivability is an overarching
requirement that implies security, reliability, adequate performance, and
many other subrequirements.
The following recommendations are ordered roughly according to how they
appear in the development and operational cycles. Their relative importance
is considered at the end of the enumeration.
- We must establish generic mission models
that can be readily tailored to specific systems, and develop
processes whereby those models can be used in evaluating the
adequacy of requirements.
- We must establish fundamental requirements for survivability and
its subtended properties that can be directly applied to system
developments and procurements, sufficiently detailed but not
overly constraining.
- We must define families of system and network architectures that
are inherently robust, and demonstrate the implementability of those
architectures.
- We must develop new network and distributed system protocols
appropriate for the development of highly survivable, secure, and
reliable information infrastructures.
- We must design and implement open-system architectural components
that are essential for robust architectures but not yet readily
available in the marketplace, which when composed together can satisfy
strong requirements for survivability and interoperability.
- We must establish a library of demonstrably sound procedures
that enable trustworthy systems to be built out of less trustworthy
components. This is the concept of generalized dependence,
which we explore in this report.
- We must establish and consistently use sound cryptographic
infrastructures for authentication, certificate authorities, and
confidentiality.
- We must find ways to encourage commercial system developers to
increase the survivability, security, and reliability of their
standard products, including encouraging them to embrace more good
research and development results.
- We must consider, as an alternative to proprietary closed-source
software, the development and use of source-available software and
nonproprietary interfaces. Although this approach does not
necessarily lead to survivable systems all by itself, it has enormous
potential when combined with other techniques.
- We must provide for mechanisms for trustworthy distribution of
trustworthy code -- including robust mobile code.
- We must refine and make practical the ongoing R&D efforts for
monitoring, analyzing, and responding to system and network anomalies,
and generalize them from merely intrusion-detection systems, so that
they address a broad range of survivability-related threats, including
reliability problems, fault-tolerance coverage failures, and classical
network management.
- We must be able to develop systems that are more easily configured
and managed without placing excessive burdens on system administrators.
- We must pursue realistic research and development relating to
practical system issues such as composability, maintainability,
evolvability, interoperability that are also strongly
based theoretically.
- We must find ways to disseminate the concepts of this report widely,
including influencing the education processes and improved training.
It is always desirable to indicate relative priorities in which such
recommendations need to be addressed, and their relative difficulty.
Unfortunately, survivability, security, and reliability are weak-link
phenomena that can be compromised in many different ways. Thus, all the
above recommendations can have considerable payoffs in efforts to develop
survivable systems, for many different reasons -- because of the holistic
nature of the desired requirements and the inherent complexity of their
realization.
It is difficult to pinpoint the recommendations that might provide the
greatest payoffs -- precisely because of the weak-link phenomena.
Besides, searching for easy answers is a common failing, especially in
complex situations in which there are no easy answers. However, in general
the greatest long-term benefits seem to accrue from up-front efforts, that
is, relating to establishing sound requirements, system designs, and
architectures, rather than focusing on software development, operations,
topical preventive measures, and maintenance. That is why we have chosen
the order of recommendations as above, implicitly placing emphasis on the
items in that order. Nevertheless, there would be major benefits from
almost all the items above.
In particular, the establishment of mission models (1) and fundamental
requirements (2) might have the greatest benefits of all, because it could
provide the basis for system developments and procurements of systems.
However, past experience with the DoD Trusted Computer Security Evaluation
Criteria
and system procurements suggests that this is not an easy path, and
that even if we had a superb set of requirements, they might be largely
ignored.
Stronger architectures, components, protocols, and cryptographic
infrastructures (3, 4, 5, 6, 7) are all potentially important to the
development process. Ideally, they need to be motivated by strong
requirements. In the absence of such explicit requirements in the past,
systems have developed according to a slow migration path that is driven
primarily by perceived market considerations, which have not converged on
what is needed. Incentivizing main-stream developers (8) and promotion of
source-available software and open systems (9,10) are both vital,
particularly if the latter inspires greater advancement by the former.
Real-time analysis of system monitoring and rapid response (11) are
essential, but primarily as a last resort in the presence of vulnerable
systems. Ideally, greater emphasis on up-front requirements and
architectures would diminish the need for real-time analysis -- at least
with respect to outsider attacks. However, this is not likely to happen for
a long time.
Building systems that are more easily administered and simplifying the role
of system administration (12) would yield great savings in labor and cost,
as well as minimizing emergency remediation (especially in combination with
more intelligent real-time analysis). However, outsourcing of
administrators is a highly riskful proposition. (Recently, system administrators in SRI's Computer Science
Laboratory complained to their counterparts at Fort Huachuca relating to a
host within the Fort Huachuca domain that was issuing repeated domain
name service (DNS) requests to a machine within our CSL
network that is not a name server. The human response was in effect, well,
it is after 3 in the afternoon on Friday, and our admin efforts are
outsourced to a contractor whose availability is uncertain. Sorry.)
Furthermore, long-term research and development issues must not be
ignored (13). Specific directions for R&D are discussed in
Section 9.2 of this report.
When attempting to confront a complex system problem, considerable benefit
can result from considering the situation in the large (top-down), rather
than attempting to patch together a bunch of existing would-be solutions
(bottom-up). The bottom-up approach typically makes unrealistic assumptions
about the independence of subproblems. The holistic approach taken here
attempts to address the whole system, and then see what can be done to
partition the problems while also dealing with the interactions among the
components. In some cases, it is advantageous to consider a somewhat more
general problem to gain insights that cannot be seen from the more specific
problem (especially when the specific problem is not well understood). We
believe that such an approach is advantageous in developing complex systems.
It is clear that systematic use of strong authentication (including
avoidance of fixed passwords) could have an enormous impact all by itself on
system integrity. Firewalls that are secure and properly administered would
help. Highly survivable servers would be a considerable benefit. More
precise requirements would have a major influence on system procurements -
if those requirements were satisfied. Serious consideration of an
open-design policy of extensive early review and the use of source-available
software where appropriate may in the long run be essential to overcome the
limitations of proprietary closed-source systems that cannot fulfill the
desired requirements. Alternative architectures including a secure
mobile-code paradigm have considerable promise, particularly in connection
with thin-client systems and highly trustworthy servers. But the bottom
line here is that the basic computer-communication infrastructure is
fundamentally inadequate today.
The use of structure is particularly important in designing, implementing,
and maintaining systems and networks. The combination of architectural
principles and the use of good software engineering and system engineering
practice can be extremely effective. In particular, it is vital to address
the full range of survivability-relevant requirements from the outset; it is
typically very difficult to make retrofits later. The notion of generalized
dependence considered in this report permits us to avoid needing total
dependence on the correctness of certain other components -- many of which
have unknown trustworthiness, or are inherently suspect. This is the notion
of obtaining trustworthiness despite the relative untrustworthiness of
certain components. This concept is increasingly important in highly
distributed computing environments. Preventing or seriously hindering
denial-of-service attacks is a particularly important architectural issue.
The mobile-code paradigm offers many potential advantages in such
environments, but it also requires some dramatic improvements in the
security, reliability, and robustness of certain critical components.
It is a difficult course that we must follow. It is evidently a
never-ending course, for a variety of reasons. As the requirements continue
to be better understood, more is demanded. As technical improvements are
introduced, new vulnerabilities are typically introduced. As technology
continues to offer new functional opportunities, and as systems tend to
operate closer to their technological limits, the vulnerabilities, threats,
and risks are increased accordingly, requiring much greater care.
Operational and administrative challenges are continually increasing. As
systems continue to grow in complexity and size, the risks seem to grow
accordingly. As a result, ever greater reliance is placed on the
omniscience and omnipotence of system administrators. Also, our adversaries
are becoming much more agile and are capable of becoming much more
aggressive. As a consequence, much greater discipline is required to
achieve the necessary goals. This report attempts to characterize what is
needed in terms of increased awareness and new approaches for the future.
1. Out of clutter, find simplicity.
2. From discord, find harmony.
3. In the middle of difficulty lies opportunity.
Albert Einstein,
three rules of work
The primary goal of this project is to significantly advance the state of
the art in obtaining highly survivable systems and networks, whereby
distributed systems and networks of systems are considered in their totality
as systems of systems, and as networks of networks -- rather than more
conventional approaches that focus only on selected properties of certain
subsystems or modules in isolation.
To accomplish that goal in this report, Chapter 2 addresses
a broad spectrum of threats to survivability. Chapter 3 considers
the overarching survivability requirements necessary to surmount those
threats, and also considers the subordinate requirements on which
survivability ultimately depends -- including reliability, availability,
security (confidentiality, integrity, defense against denials of service and
other types of misuse), performance, in the presence of accidental and
malicious actions and malfunctions of software and hardware.
Chapter 4 then identifies fundamental deficiencies in the
technology available today, and Chapter 5 makes recommendations
for how to overcome those deficiencies. Subsequent chapters address
guidelines for developing and rapidly configuring highly survivable systems
and networks, including the presentation of generic classes of architectural
structures and some specific types of systems. Appendix A considers how
the contents of this report might find their way into an educational
curriculum.
Despite the quoted dictum of Albert Einstein at the beginning of this
chapter, we observe that general-purpose systems and networks that must be
highly survivable are not likely to be simple -- unless they are seriously
trivialized. The nature of the problem is intrinsically complex:
experience shows that many vulnerabilities are commonplace, and not easy to
avoid; the potential threats are very broadly based; complexity is often
beyond the scope of a small and closely knit development team; management is
often unaware of the complexities and their implications. Consequently, the
approach of this report is to confront the challenge in its full generality,
rather than merely to carve out a simply manageable small subset. Remember
the following quote, which is also very pithy:
Everything should be as simple as possible -- but no simpler.
Albert Einstein 1
Recognizing the complexity inherent in satisfying any realistic set of
survivability requirements, we have chosen to consider the very difficult
fully general problem of achieving highly survivable systems and networks
subject to the widest spectrum of threats. By tackling the general problem,
we believe that much greater insight can be gained and that the resulting
approaches can look farther into the future. In this sense, we believe that
there is a significant opportunity in the face of the intrinsic
difficulties.
Basic concepts are identified and defined here that are used throughout the
report, including survivability, security, reliability, performance,
trustworthiness, dependability, assurance, mandatory policies, composition,
and dependence. Section 1.3 introduces the notion of
compromisibility.
For the purposes of this report, survivability is the ability of a
computer-communication system-based application to satisfy and to continue
to satisfy certain critical requirements (e.g., specific requirements for
security, reliability, real-time responsiveness, and correctness) in the
face of adverse conditions. Survivability must be defined with respect to
the set of adversities that are supposed to be withstood. Types of
adversities might typically include hardware faults, software flaws, attacks
on systems and networks perpetrated by malicious users, and electromagnetic
interference.2
Thus, we are seeking systems and networks that can prevent a wide range of
systemic failures as well as penetrations and internal misuse, and can also
in some sense tolerate additional failures or misuses that cannot be
prevented.
As currently defined in practice, requirements in use today for survivable
systems and networks typically fall far short of what is really needed.
Even worse, the currently available operating systems and networks fall even
farther short. Consequently, before attempting to discuss survivable
systems, it is important to establish a comprehensive set of realistic
requirements for survivability (as in Chapter 3). It is also
desirable to identify fundamental gaps in what is currently available (as in
Chapter 4).
Given a well-defined set of requirements, it is then important to define a
family of reusable interoperable baseline system and network architectures
that can demonstrably attain those requirements -- with the goals of
enhancing the procurement, development, configuration, assurance,
evaluation, and operation of systems and networks with critical
survivability requirements.
A preliminary scoping of the general survivability problem was suggested by
a 1993 report written for the Army Research Laboratory (ARL), Survivable Computer-Communication Systems: The Problem and Working Group
Recommendations [29]. That report outlines a comprehensive
multifunctional set of realistic computer-communication survivability
requirements and makes related recommendations applicable to U.S. Army and
defense systems.3 It assesses the vulnerabilities,
threats, and risks associated with applications requiring survivable
computer-communication systems. It discusses the requirements, and
identifies various obstacles that must be overcome. It presents
recommendations on specific directions for future research and development
that would significantly aid in the development and operation of systems
capable of meeting advanced requirements for survivability. It has proven
to be useful to ARL as a baseline tutorial document for bringing Army
personnel up to speed on system vulnerabilities and basic concepts of
survivability. It remains timely. Some of its recommended research and
development efforts have still not been carried out, and are revisited here.
The current technical approach is strongly motivated by a collection of
highly disciplined system-engineering and software-engineering concepts that
can add significantly to the generality and reusability of the results, as
well as having specific applicability to Army developments. Above all, our
approach here stresses the importance of sound system and network
architectures that seriously address the necessary survivability
requirements. This approach entails several basic concepts that are
considered in the following subsections.
The following three bulleted items consider three types of infrastructures:
(1) the critical national infrastructures, (2) information infrastructures
such as the Internet, or whatever it may evolve into (a National Information
Infrastructure, or a Global Information Infrastructure, or a Solar-System
Information Infrastructure, or perhaps even the Intergalactic Information
Infrastructure), and (3) underlying computer systems and networking software.
- Survivability of the critical national infrastructures. The
1997 report [194] of the President's
Commission on Critical Infrastructure Protection (PCCIP) summarizes
the PCCIP's recommendations relating to eight major critical national
infrastructures: telecommunications; generation, transmission, and
distribution of electric power; storage and distribution of gas and
oil; water supplies; transportation; banking and finance; emergency
services; and continuity of Government services. Perhaps most
important from the present perspective is the recognition that very
serious vulnerabilities and threats exist in all these critical
infrastructures. Perhaps equally important is the Commission's
recognition that these critical infrastructures are closely
interdependent and that they all depend on underlying
computer-communication infrastructures.
- Survivability of the computer-communication infrastructures on
which the national infrastructures depend, such as the Internet and
its eventual successors. A comprehensive system- and network-wide
set of realistic requirements is desired, encompassing security,
reliability, fault tolerance, real-time performance, and any other
attributes necessary for attaining adequate system and network
survivability. From this set of requirements, it is possible to
select those that are specifically relevant to any desired system.
The Internet is seriously vulnerable to denial-of-service attacks,
losses of confidentiality and integrity in transmission, and collapse
of constituent nodes. In the future, critical applications are likely
to demand alternative information infrastructures that provide greater
survivability and its subtended requirements, with particular
attention to reliability, availability, and security --
cryptographically based confidentiality and integrity, protection
against denials of service, and basically a completely new set of
protocols designed with security in mind from the very
beginning -- including secure
interoperability with other infrastructures. Such a national
information infrastructure would be of enormous value to DoD and to
the critical national infrastructures, and would also be valuable for
electronic commerce.
- Survivability of the underlying computer systems and
communication systems. Survivability of the computer-communication
infrastructures depends on dependable operating system and network
security, dependable system and network reliability, and dependable
operational performance. Much of the emphasis in this report
is on these basic information infrastructures.
System attributes that are particularly relevant to the attainment of
survivability include the following.
- Security.
Security must encompass dependable protection against all relevant
concerns, including confidentiality, integrity, availability despite
attempted compromises, preventing denials of service, preventing and
detecting misuse, providing timely responses to perceived threats, and
reducing the consequences of unforeseen threats. It includes both
system security (e.g., protecting systems and networks against
tampering and other misuse) and information security (e.g., protecting
data and programs against tampering and other misuse). It must
anticipate all realistic threats, including misuse by insiders,
penetrations by outsiders, accidental and intentional interference
(e.g., electromagnetic), emanations, covert channels, inference, and
data aggregations. There is much more to security than merely
providing confidentiality, integrity, and availability. All
components that must be trusted in order to achieve adequate system
behavior must actually be trustworthy. (The distinction between these
two concepts is discussed in Section 1.2.3.)
- Reliability and availability.
Reliability is often defined as a measure of how well a system operates
within its specifications. For, example, fault tolerance
can enable a
variety of alternatives, including real-time, fail-safe, fail-soft,
fail-fast, and fail-secure modes of operation. Availability despite
system failures must be tailored to a variety of specific needs, with
different techniques used for different functionality, as appropriate.
It must address unintentional and malicious changes in the operating
environment, including those that result from power outages and power
variations, earthquakes, floods, and other natural disasters. There
should be no serious weak links that are vulnerable to perceived
threats, and system design should be defensive enough that it also
addresses some of the more serious unanticipated threats.
- Performance. Particularly in real-time systems, performance tends to be a
critical requirement. In some cases, adequate performance may be critical
to the survivability of the services provided by an enterprise or an
application. On the other hand, in most cases, performance is itself
dependent on survivability and availability -- if a system is not
survivable, adequate performance cannot be achieved. To avoid this apparent
interdependence loop, it would be possible to redefine performance
requirements as being meaningfully specified only when the relevant systems
are available. However, this seems to be a cop-out, because of the need to
ensure adequate performance that is itself critical to survivability.
What is immediately obvious is that close interrelationships exist among
the various requirements. For example, consider the various forms of
availability. Availability is clearly a security requirement in defending
against malicious attacks. It is clearly a reliability requirement in
defending against hardware malfunctions, unanticipated software flaws,
environmental causes, and acts of God. It is also a performance issue, in
that adequate availability is essential to maintaining adequate performance
(and conversely, adequate performance can be essential to maintaining
adequate availability, as noted above).
Whereas it is conceptually possible to consider these different
manifestations of availability as separate requirements, this is very
misleading -- because they are closely coupled in the design and
implementation of real systems and networks. As a consequence, we stress
the notion of architectures that address these seemingly different
requirements in an integrated way that permits the realization of different
requirements within a common structure. This is pursued further in
Section 3.1.
Fundamental to this report are the notions of trustworthiness,
dependability, and assurance.
- Trustworthiness versus trust. In the present
context, trustworthiness is a measure of how extensively a given
module, system, network, or other entity deserves to be trusted to satisfy
its stated requirements when confronted with arbitrary threats. In the
security community, trustworthiness is roughly equivalent to what is called
"dependability" in the fault-tolerance community, although dependability
was originally relevant primarily to the threats intended to be covered by
fault tolerance, and did not encompass what we refer to here as
trustworthiness. (In the present context, assurance -- considered
below -- relates to the certainty with which trustworthiness or
dependability can be believed, for example, through the use of testing and
formal analyses.) In this report, the notion of trustworthiness encompasses
all aspects of survivability, including the full spectrum of threats to
survivability and its subtended requirements.
Trustworthiness is particularly relevant in situations where there are
critical requirements, that is, where dependence on the trustworthiness
of specific entities is crucial to the overall behavior of a system or
network in the large -- particularly with respect to survivability,
security, and reliability. In the fault-tolerance community, dependability
tends to be a measure of how well the specified fault-tolerance
requirements are met, although recent usage is generalizing that to other
requirements.
A careful distinction is made here between trust and trustworthiness. Trust
is something you attribute to a system entity, whether that entity is
trustworthy or not. A trustworthy entity is one that deserves to be
trusted.
In general usage in the literature, a trusted system is one that must be
trusted in order for applications using the system to behave properly.
Ideally, trusted systems should be trustworthy, although that is often not
the case. For example, the notion of trusted computing bases
(Section 7.2) is really concerned with trustworthiness of
components that, because of their functionality, have to be trusted -- and
that therefore must be trustworthy.
- Dependability and assurance. In this
report, we refer only loosely to dependability, as the extent to which
a given requirement is perceived to be satisfied, particularly by the
implementation. Assurance is then the credibility that can be given
to specific statements of dependability and trustworthiness. Thus, for
example, we might rather specifically refer to the assurance of the
dependability of trustworthiness within a particular system design or its
implementation, although in general we are primarily concerned with
trustworthiness itself and avoid the circumlocution. The reader will not
suffer by assuming that the assurance of trustworthiness and the
predictability of dependability (e.g., [313]) are indeed the same
concept (or, at least, closely enough related for present purposes).
The foregoing concepts
-- survivability, security, reliability, and performance -- need to
be implemented in such a way that the desired properties can be
achieved dependably. Defensive measures include establishment of
appropriate requirements, good system design that is consistent with
the requirements, good system development and coding practice
including the use of modern software engineering and sound programming
languages and demonstrations that implementations are consistent with
their designs, and operational procedures that maintain the integrity
of design and implementation despite ongoing debugging and
maintenance -- and potential misuse.
- Assurance and analytic dependability. Overall system and
network survivability should be predictably demonstrable by methods
other than simply testing. For example, formal
methods
could be used to demonstrate the adequacy
of the requirements, the consistency of the specifications with those
requirements, and the consistency of the implementation with the
specifications, at different layers of abstraction, and the
consistency of one layer with another. Other constructive analytic
arguments could measure the assurance with which a system architecture
might survive specific types of threats, even despite unanticipated
events. Analytic tools are also useful in uncovering flaws and
questionable coding practices in software. Various approaches to
assurance exist, including formal methods discussed in
Section 5.9.
Various other attributes are also highly desirable in ensuring dependable
survivability.
- Subsystem composability.
Subsystems and components should be designed and implemented to
enhance the ease with which they can be integrated together
without adversely affecting survivability. Composability is
considered further in Section 5.8.
- Interoperability.
Interoperability should be easily attainable, across different
networks, systems, subsystems, and application services. Various
platforms should be accommodated, including mainframes, minis,
workstations, personal computers, and combinations thereof. Firewalls
and other controls necessary for security should not unduly impede
interoperability where authorized. Relevant standards should be
respected, but inadequate standards must be replaced, upgraded, or
explicitly ignored.
- Scalability.
The above-mentioned concepts must be capable of
adapting to a range of operations, from local operation to widely
dispersed systems, from a few users to a considerable community of
users, from a closed community to an open-system environment. Where
single approaches are not applicable, the parameterized configuration
should permit adaptation according to the specific requirements.
- Abstraction.
At each of various layers of abstraction,
implementations must be properly (e.g., survivably, securely, and
reliably) encapsulated, with appropriate information hiding and
high-integrity implementation. System interfaces and programming
languages must provide suitable abstractions, and must be
nonbypassable.
- Prevention, detection, toleration, and
reaction. Neither prevention nor
detection is adequate by itself. An appropriate balance must be found
between the two, utilizing constructive design techniques as well as
proactive detection of events that may be suspicious with respect to
the requirements for survivability, security, reliability, and so
forth. In addition, toleration of adverse events that cannot be
prevented must be planned in system design, irrespective of whether
detection succeeds. Traditional fault tolerance is intended to combat
reliability-related threats, whereas its counterpart in
security-related threats must be to withstand to whatever degree
practicable malicious attacks that cannot be prevented. Finally,
speedy reaction is necessary when something adverse has been detected
that cannot be tolerated.
- Dependencies and interrelationships. In any distributed system
architecture, whether hierarchically layered or highly
interrelational, it is undesirable to have to depend on inherently
less trustworthy components -- with respect to security, reliability,
and availability. If such adverse dependence is essential to minimize
or control harmful effects, it must be demonstrably not in conflict
with the desired requirements. For example, deadlocks can be caused
accidentally or intentionally, and can result in denials of service
unless the system architecture takes explicit measures to avoid them.
Devastating consequences can result from dependence on untrustworthy
components that are nevertheless ill-advisedly trusted. It is also
desirable to identify and analyze the interactions among the different
requirements -- with respect to the requirements themselves, as well
as with respect to conflicts that may arise in the design and
implementation. If potentially conflicting consequences can arise,
the priorities necessary to resolve those conflicts should be
established in advance, insofar as possible. As an example, consider
the use of alternative components or redundant communication paths to
increase availability; one undesirable consequence could be increased
exposures to security attacks resulting from the additional objects of
attack. This is just one more example of the importance of
understanding the dependencies and interrelationships throughout the
development cycle.
- Operational practice.
The best design and implementation can be
totally compromised by bad administrative and operational practice.
Sound configuration control is absolutely essential as an integral
part of survivability. (For example, see [391].)
These concepts are considered further in Sections 7.1
and 7.2.
Whereas we have chosen a framework in which survivability depends on
security, reliability, and performance attributes (for example),
manifestations of survivability, security, and reliability exist at many
different layers of abstraction. Although the survivability of an
enterprise may depend on the underlying security and reliability, the
security and reliability at a particular layer may in turn depend to some
extent on the survivability of a lower layer. For example, the
survivability of each of the eight critical national infrastructures
considered by the PCCIP depends to
some extent on the survivability and other attributes of the underlying
computer-communication infrastructures. Similarly, the survivability of a
given computer-communication infrastructure may typically depend to
considerable extent on the survivability of the electric power and
telecommunications infrastructures. In part, this is a consequence of the
fact that the definitions used here are (necessarily) somewhat overlapping;
in part, it is also a recognition of the fact that each abstract layer has
its own set of requirements that must be translated into subrequirements at
lower layers.
One of the primary goals of the present work is to identify the ways in
which the various properties and their enforcing implementations depend on
one another, at various layers of abstraction and across different
abstractions at given layers.
This report in no way attempts to be a definitive self-contained treatise on
everything that needs to be known to procurers and developers of highly
survivable systems. Rather, it attempts to identify and use constructively
some of the fundamental concepts upon which such systems can be produced.
Extensive further background on computer system trustworthiness can be found
in National Research Council reports, Computers at Risk [72]
and the more recent Trust in
Cyberspace [345].
(See also [109] for a recent NRC study on research needs.)
Two valuable volumes on cryptography's role in trustworthy systems and
networks are the National Research Council CRISIS report Cryptography's
Role in Securing the Information Society [84]
and Bruce Schneier's Applied
Cryptography [347].
A realistic assessment of the risks of
improperly embedded strong crypto is found in Schneier's subsequent
book [348], Secrets and Lies: Digital Security in a
Networked World.
Research efforts have typically considered simple compositions of modules,
such as unidirectional serial connections or perhaps call-and-return
semantics. (Section 5.8 discusses some of these.) However,
the existing research is far from realistic.
The concept of generalized composition [251] used here
includes composition of subsystems with mutual feedback, hierarchical
layering in which a collection of modules forms a layer that can be used by
higher layers as in the Provably Secure Operating System
(PSOS) [102, 246, 247, 260],
layering achieved through program modularity [45], and networked
connections involving client-server architectures, gateways, unidirectional
and bidirectional firewalls and guards, encryption, and other components.
Relevant approaches include [371].
In this project, we consider generalized composition as it relates to the
composed subsystems. We believe that this approach to composition is more
appropriate to the intended large-scale distributed and networked
architectures than the primarily theoretical contemporary work on model
composition and policy composition (although that work is logically subsumed
under the present approach).
In 1974, Parnas [279]
characterized a variety of depends
upon relations. An important such relation is Parnas's depends upon
for its correctness, whereby a given component is
said to depend upon another component in the sense that if the latter
component does not meet its requirements, then the former may not meet its
requirements. Neumann [251] has revisited the notion of
dependence, making a distinction between the Parnas relation depends
upon for correctness and a generalized sense of dependence in which greater
trustworthiness can be achieved despite the presence of less trustworthy
components, thereby avoiding having to depend completely on components of
unknown or uncertain trustworthiness. To avoid having to say "depends upon
in the sense of generalized dependence", we abbreviate that generalized
relation as simply depends on.
The following enumeration gives various paradigms under which
trustworthiness can actually be enhanced,
providing examples of how the generalized dependence relation depends-on differs from the conventional depends-upon relation. In
each of these cases, the resulting trustworthiness tends to be greater
than that of the constituent components. The list is surprisingly long,
and may help to illustrate the power of the notion of generalized
dependence. (Although particular mechanisms may fall into multiple types,
these types are intended to represent the diverse nature of mechanisms
having the characteristics of generalized dependence.)
- The use of error-correcting codes
(e.g., [123]) that can enable correct communications despite
certain tolerable patterns of errors (e.g., random, asymmetric as in
bit-dropping only, bursty, or otherwise correlated), in block communications
or even in variable-length or sequential encoding schemes, as long as any
required redundancy does not cause the available channel capacity to be
exceeded (following the guidance of Shannon's information theory), and
in arithmetic operations (e.g., [268])
- The early work of John von Neumann [384] and of Ed Moore
and Claude Shannon [222], who showed how reliable subsystems in general (von
Neumann) and reliable relay circuits in particular (Moore-Shannon) can be
built out of unreliable components -- as long as the probability of failure
of each component is not precisely one-half and as long as those
probabilities are independent from one another; also relevant is the 1960
paper of Paul Baran [27] on making reliable communications despite
unreliable network nodes, which was influential in the early days of the
ARPAnet.
- Self-synchronizing
techniques that result in rapid
resynchronization following nontolerated errors that cause loss of
synchronization, including intrinsic resynchronizability of sequentially
streamed codes -- by adding explicit framing bits, or adding redundancy to
provide implicit synchronization as in comma-free codes, or without having
to add any redundancy in certain variable-length and sequential
codes [240, 241, 242] (as self-resynchronizing
properties of certain variable-length codes [135] and
information-lossless [136] sequential encoding systems) -- as
well as other self-stabilization techniques (e.g., [97])
- Robust synchronization algorithms, such
as hierarchically prioritized locking
strategies [94],
two-phase
commitments, nonblocking atomic
commitments [315], and fulfillment
transactions [205] such
as fair-exchange protocols guaranteeing that payment is made if and only
if goods have been delivered
- Traditional fault-tolerance algorithms and
system concepts that can tolerate certain specific types of component or
subsystem failures as a result of constructive use of
redundancy [18, 80, 145, 186, 206, 225, 389]
-- although failures beyond the coverage of the fault tolerance may result
in unspecified failure modes
- Alternative-computation architectural structures, which can achieve
satisfactory but nonequivalent results (with possibly degraded performance),
despite failures of hardware and software components
and failure modes that exceed planned fault coverage, such as the Newcastle
Recovery Blocks
approach [17, 18, 134]
- Alternative-routing schemes in packet-switched networks,
which can attain good performance and eventual
communications despite major outages among intermediate nodes and
disturbances in communications media (as in the ARPAnet routing protocols)
- Byzantine fault-tolerant systems that can
withstand Byzantine fault
modes [164, 334, 342], whereby
successful operation is possible despite the arbitrary and completely
unpredictable behavior (maliciously or accidentally) of up to some ratio of
its component subsystems (e.g., k out of 3k+1), with no
assumptions regarding individual failure modes of the component subsystems
- Byzantine network-layer protocols [295]
- Encryption applied to an open transmission medium or storage medium
that is easily intercepted or monitored, whereby the encrypted form is
significantly more inscrutable
- Use of integrity checks, such as
cryptographic checksums and proof-carrying
code [235], both of which can enable
the detection of unexpected alterations to systems or data and
hinder the tampering of data and programs
- Micali's fair public-key
cryptographic schemes [209], in which different parties must
cooperate with the simultaneous presentation of multiple keys -- allowing
cryptographically based operations to require the presence of multiple
authorities
- Threshold multikey-cryptography schemes, in which at least k
out of n keys are required (for conventional symmetric-key decryption, or
for authentication, or for escrowed retrieval) -- for example, a Byzantine
digital-signature
system [91] and a Byzantine key-escrow
system [318] that can function
successfully despite the presence of some parties that may be untrustworthy
or unavailable, as well as a signature
scheme that can function correctly despite the presence of malicious
verifiers [296]
- Byzantine-style authentication protocols that can work properly
despite untrustworthy user workstations, compromised authentication servers,
and other questionable components (see
Chapter 7)
- Constructive use of kernels
and "trusted" computing bases to achieve
nonsubvertible application properties, such as in SeaView, which
demonstrated how a multilevel-secure database management system can be
implemented on top of a multilevel-secure kernel -- with absolutely no
requirement for multilevel-security trustworthiness in the Oracle database
management
system. [88, 188, 190]
(This is the notion of balanced assurance.)
- Multilateral mechanisms enforcing policies of mutual suspicion, with
the ability to operate correctly despite a lack of trust among the various
parties [351]
- Interposition of trustworthy firewalls and
guards that mediate between regions of unequal trustworthiness
-- for example, ensuring that sensitive information does not leak out and
that Trojan horses and other harmful effects do not sneak in, despite the
presence of untrustworthy subsystems or mutually suspicious adversaries
- Use of run-time checks to prevent or mediate
execution in questionable circumstances (e.g., embedded in the base
programs or in application programs, as in the cases of bounds checks and
consistency checks)
- Addition of wrappers (without modifying the source or object
code of the wrapped module), to enhance survivability, security, or
reliability, or otherwise compensate for deficient components -- such as
adding a "trusted path" to an inherently untrustworthy system, enabling
monitoring of otherwise unmonitorable functionality, or providing
compatibility of wrapped legacy programs with other programs
- Object-oriented, domain-enforcement,
and access-control techniques that effectively mediate or otherwise modify
the intent of certain attempted operations, depending on the execution
context [102, 260, 351] -- for example, the confined
environment of the Java Virtual Machine [114, 116] and related
work on formal specification [87, 112] for the analysis of
the security of such environments
- Use of real-time analysis techniques such as anomaly and misuse
detection to diagnose live threats and respond accordingly, capable of
dynamically altering system and network configurations based on perceived
threats (e.g., [304, 305])
Each of these paradigms demonstrates techniques whereby trustworthiness can
be enhanced above what can be expected of the constituent subsystems or
transmission media. By generalizing the notions of dependence and
trustworthiness, and judicious use of some of these techniques, we seek to
provide a unifying framework for the development of survivable systems.
Dependence on components and information of unknown trustworthiness
is a particularly serious potential problem.
(See Sections 2.1.1 and 2.1.2.)
Dependable clocks
(Byzantine or otherwise) provide a particularly
interesting challenge. Lincoln,
Rushby, and others [181] provide an elegant detailed example
of generalized dependence. They have analyzed a three-layered model
consisting of (1) clock
synchronization [332], (2)
Byzantine agreement [179, 180], and (3)
diagnosis and removal of faulty components [180]. They
also exhibit formal verifications for a variety of hybrid
algorithms [180] that can greatly increase the coverage of
misbehaving components. This three-layered integration of separate models
and proofs is of considerable practical interest, as well as illustrative of
forefront uses of formal methods.
An example of generalized dependence relating to clock drift is given by
Fetzer and Cristian [104] in developing fault-tolerant
hardware clocks out of commercial off-the-shelf (COTS) components, at least
one of which is a GPS receiver. A formal analysis of a time-triggered clock
synchronization approach is given
by [299].
The basic approach of this project considers within a common framework many
different generalized-dependence mechanisms that are capable of enhancing
trustworthiness, enabling the resulting functionality to be inherently more
trustworthy than otherwise might be warranted by consideration of only its
constituent components.
Ultimately, overall system survivability may depend on (in the sense of
generalized dependence noted above) the security, integrity, reliability,
availability, and performance characteristics of certain critical portions
of the underlying computer-communication infrastructures. In this report,
our notion of survivability explicitly includes this context of generalized
dependence.
Compromises from outside, from within, or from below (see
Section 1.3
and [250, 251, 267]), whether
malicious or not, can subvert survivability unless prevented or ameliorated
by the architecture, its implementation, and the operational practice.
Unfortunately, compromises from outside (e.g., externally, originating from
higher layers of abstraction or from other entities at the same layer of
abstraction, or from supposedly security-neutral applications) often can
lead to compromises from within (affecting the implementation of a
particular mechanism) or from below (subverting a mechanism by tampering
with its underlying dependent components). One of the fundamental
challenges addressed here is to be able to design, implement, and operate
survivable systems despite the presence of components, information, and
individuals of unknown trustworthiness -- as well as saboteurs (e.g.,
cyberterrorism [302]), and thereby to prevent, defend against, or
at least detect attempted compromises from outside, within, or below. This
is in essence what we mean by survivability -- in the context of
generalized dependence on potentially unknown entities. For example, a
particularly difficult challenge is to ensure that the embeddings of sound
cryptographic algorithms cannot be compromised because of inherent
weaknesses in the underlying computer-communication infrastructures (e.g.,
hardware, microcode, operating systems, database management, and networking)
-- as discussed in [249].
Survivability is an emergent property of the
overall systems and networks. That is, it is not definable and analyzable
in the small, because it is the consequence of the composition of the
subtended functionality; it must be considered in the large. In other
words, it is not a property that can be identified with any of the
constituent components. Ideally, it should be derivable in terms of
properties of the constituent functionality on which it depends, as
described in the 1970s work of Robinson and Levitt [322] on the SRI
Hierarchical Development Methodology (HDM) as part of the PSOS
effort.4
In practice, it may not be so derivable, as in the case of covert channels
that arise only because of module composition.
Stephanie Forrest in her introduction to the 1991 CNLS
proceedings [106], Nancy Leveson [173], Heather
Hinton [127, 128], Zakinthinos and
Lee [394], and D.K. Prasad [306] provide some
background on emergent properties; Zakinthinos and Lee define an emergent
property as one that its constituent components do not satisfy. Prasad
draws on measurement theory and decision analysis [307] to show
that such properties are not compositional and also that such properties are
not `absolute' -- different stakeholders may have different ideas about the
meaning of the property. Her thesis work also presents the method of
multi-criteria decision making (in a specific framework) as an approach for
the measurement (on a sound theoretical basis) of such properties.
Hinton [128] observes that undesirable emergent behavior is often
the result of incomplete specification, and can be formally analyzed.
The notions of multilevel
security [32, 33, 34, 35, 36], multilevel
integrity [42], and multilevel
availability [267] characterize hierarchical mandatory
policies for confidentiality, integrity, and availability, respectively. In
multilevel security (MLS),
information is not permitted to flow from one entity to another entity that
has been assigned a lower security level. In multilevel integrity
(MLI),
no entity is permitted to depend upon an entity that has been assigned a
lower integrity level. In multilevel availability
(MLA), no entity is permitted to depend on an entity that has been assigned
a lower availability level.
Although it has been the subject of considerable research in security
policies and kernelized system architectures,
and highly touted by the Department of Defense (see Chapter 6),
multilevel security has remained very difficult to achieve in realistic
systems and networks. This is due to many factors, including inadequacies
in the DoD criteria, an unwillingness of commercial system providers to
develop systems, and an unwillingness of non-DoD system acquirers to
consider such systems. Architectural alternatives are considered in
Chapter 7.
Strict multilevel integrity is thought to be awkward to enforce in practical
systems, because high-integrity users and processes often depend on editors,
compilers, library routines, device drivers, and so on, that are typically
not necessarily trustworthy and therefore are risky to depend upon.
However, that is precisely the fundamental integrity problem in most system
architectures. The implicit web of trust should force those utility
functions to be at least as trustworthy with respect to integrity, because
they must all be considered within the perimeter of trustworthiness.
The
notion of generalized dependence is one way of working within that
constraint without either sacrificing the power of the basic concepts or of
introducing new vulnerabilities that result from informal deviations from
strict interpretations.
In this report, we consider the conceptual use of this kind of mandatory
basis for survivability. Strictly speaking, this would lead to a
lattice-based mandatory policy for multilevel survivability that directly
imitates the MLS, MLI, and MLA policies. For simplicity, we refer to this
policy as simply multilevel survivability (MLX). In an oversimplified
formulation of the multilevel survivability policy, no system or network
entity is allowed to depend on an entity that has been assigned a lower
survivability level (unless an explicit generalized-dependence mechanism is
established that permits the use of mechanisms of lower trustworthiness, as
illustrated in Section 1.2.5). These concepts are considered in this
report to include generalized dependence.
For descriptive purposes, we implicitly assume the possibility of
compartments in each of these policies (MLS, MLI, MLA, and MLX), although we
describe the policies in terms of levels (without categories). Because of
the compartments (familiar to afficianados of MLS and MLI), the ordering on
the levels and compartments generates a mathematical lattice in each
instance. Thus, when we refer to mandatory policies in this context, we
imply lattice-based policies rather than just completely ordered levels
(without compartments).
In the absence of generalized dependence, strict MLX ordering would most
likely suffer the same kind of problems that arise in the practical use of
strict MLI -- namely, the realization that enormous portions of any given
distributed system must be of high integrity and high survivability. The
notion of generalized dependence therefore allows the strict partial
ordering to be relaxed locally whenever it is possible to achieve greater
trustworthiness out of less trustworthy components, as illustrated in
Section 1.2.5 -- without relaxing it in the large.
For readers who shudder at the complexities and inconveniences introduced by
multilevel policies, we hasten to add that the MLX property is considered
only as a structural organizing concept rather than as an explicit goal of
design and implementation. Furthermore, even if MLX were interpreted
seriously, there is always a likelihood that the levels and compartments
might be set up in such a way that there would be a fundamental conflict
among the MLS, MLI, MLA, and MLX constraints that would prevent expected
results from happening. Consequently, MLX is introduced only to encourage
the intuitive design of systems in which we avoid unnecessary dependence on
components that are inherently less survivable (in the sense of generalized
dependence).
This initial discussion represents a first approximation to what is actually
needed. In Chapter 7, we address the possible conflicts
among the subrequirements of survivability in the context of generalized
dependence.
To illustrate the importance of dependence on properties of underlying
abstractions, consider the necessity of depending on a life-critical system
for the protection of human safety.5 In such a system, safety ultimately depends upon the confidentiality, integrity, and availability of both the
system and its data. It may also depend on information survivability. It
may further depend upon component and system reliability, and on
real-time performance.
It also usually depends upon the correctness
of much of the application code. In the sense that each layer in a
hierarchical system design depends upon the properties of the lower
layers, the way in which trusted computing bases are layered becomes
important for developing dependably safe systems -- particularly in those
cases in which the generalized depends on relation can be used more
appropriately instead of depends upon to accommodate an implementation
based on less trustworthy components.
The same dependence situation is true of secure systems, in which each layer
in the abstraction hierarchy (e.g.,
consisting of a kernel, a trusted computing base for primitive security,
databases, application software, and user software) must enforce some set of
security properties. The properties may differ from layer to layer, and
various trustworthy mechanisms may exist at each layer, but the properties
at a particular layer are derivable from lower-layer properties.
In the security context, many notions of compromise exist. For example,
compromise might entail accessing supposedly restricted data, inserting
unvalidated code into a trusted environment, altering existing user data or
operating-system parameters, causing a denial of service, finding an escape
from a highly restricted menu interface, or installing or modifying a rule
in a rule-base that results in subversion of an expert system.
There is an important distinction between having to depend on lower-layer
functionality (whether it is trustworthy or not) and having some meaningful
assurance that the lower-layer functionality is actually noncompromisible
under a wide range of actual threats. Noncompromisibility is particularly
important with respect to security, safety, and reliability.
Potentially, a supposedly sound system could be rendered unsound in any of
three basic ways:
- Compromise from outside (intuitively, above or laterally -
from elsewhere at the same abstraction layer)
- Compromise from within (intuitively, inside a component or layer)
- Compromise from below (intuitively, underneath)
Each of these situations could be caused intentionally, but could also
happen accidentally. (For descriptive simplicity, a user may be a
person, a process, an agent, a subsystem, another system, or any other
computer-related entity.)
- Compromise from outside typically originates from an access
point that is nominally external to the component being compromised.
- In cases of purposeful compromise from outside, the perpetration is
typically that of a completely unprivileged user or a partially privileged
user who gains access to perpetrate a further compromise. In general,
authorization may be unnecessary, possibly because of an exploitable flaw in
the standard interface; in some cases, authorization may be bypassed.
- In cases of accidental compromise from outside, the compromise may
result from an inadvertent program error in a higher layer that somehow
affects a lower layer, or in another module at the same layer for which
there is inadequate isolation.
- Compromise from within typically originates inside a particular
component that is compromised, existing at a given layer of abstraction.
- In cases of purposeful compromise from within, the perpetration is
performed by a user or program that has somehow gained access (with or
without authorization) to the internals of a component, such as privileged
maintenance access to a database management system, network controller, or
automatic teller machine. The component could be compromised by an
authorized user who is misusing privileges, or by a penetrator; once a
perpetrator has gained access to the component internals, such a distinction
may be academic, because the penetrator is now more or less
indistinguishable from an insider. Thus, compromise from within may follow
from a compromise from outside that enables a subsequent penetration.
- In cases of accidental compromise from within, the compromise could
involve flaws or malfunctions associated with the particular system
component.
- Compromise from below is initiated at a lower layer of
abstraction than the layer at which the compromise of a given component
occurs. Compromise from below may result from malicious action or
accidental failure of an underlying mechanism on which the particular
component depends. It typically affects the particular component by
altering the state of lower-layer functionality, or in some cases merely by
gaining access to information in a lower-layer abstraction and using that
information in some unexpected way. This is roughly equivalent in meaning
to subversion.
- In cases of purposeful compromise from below, the perpetration is
performed by a user who has somehow gained access (with or without
authorization) to layers of abstraction underlying a particular component
that is being compromised, which can then be undermined without attacking
the component itself. Examples include (1) obtaining the unencrypted form
of an encrypted message by reading a temporary file in storage, (2) finding
an occurrence of a particular word in a restricted database to which access
is not permitted by scanning the disk on which that database is stored, and
(3) editing the raw text of an enqueued mail message after it is released by
a user but before it is actually sent out by the mailer. Thus, compromise
from below may follow from a compromise from outside or compromise from
within that enables a subsequent penetration to the lower-layer mechanisms.
- Cases of accidental compromise from below often involve the results of
flaws or malfunctions at lower layers of abstraction that in some way alter
or otherwise affect the expected behavior of the particular component. For
example, consider a rather dramatic error that occurred in the early 1960s
in the MIT Compatible Time-Sharing System
(CTSS). The entire unencrypted file of user passwords was printed out as the message of the day for each new user login.
This resulted from a shortsighted naming convention in the context editor
being used at the same time by two different operators in a shared system
directory [250]. Two temporary file names were the same
for each invocation of the editor, and the temporary files became
interchanged between two different users in the same directory. Notable
examples of hardware flaws include the Pentium floating divide
flaw (discussed in the Risks Forum in
RISKS-16.57-59,61,66,67,69,71,72,81, for example) and security-relevant
flaws in other processors (e.g., [357]).
The distinctions among these three modes tend to disappear in systems that
are not well structured, in which inside and outside are indistinguishable
(as in systems with only one protection state), or in which outside and
below are merged (as in flat systems that have no concept of
hierarchy). In addition, compromises from
outside may subsequently enable compromises from within, and compromises
from outside or within may subsequently enable compromises from below. The
distinctions are also murky in cases of emergency operations. Furthermore,
an egregious process whereby vendors can disable software remotely is
discussed in Section 2.4.
Certain attack modes may occur in any of these forms of compromise. For
example, consider the following Trojan-horse perpetrations, which can take
place in each form.
- Compromise from outside: a letter bomb (e.g., electronic mail or Word
macro virus) that when read or interpreted can result in unanticipated
executions, or a spoofing attack that piggybacks on a line or replays a
message
- Compromise from within: a surreptitious code patch that maintains a
hidden trickle file of sensitive information within the program data
- Compromise from below: a wiretap implanted inside a telephone switch,
or Ken Thompson's
now-classical object-code modification of the C compiler
that permitted a trapdoor routine to be planted in the
login [372] (whereby it becomes clear that system security also
depends upon the compiler). Thompson's Trojan horse was inserted into
the object code of the C compiler (with no change in the source of the
C compiler), lurking until the next recompilation of the login routine, when
it created a trapdoor in the object code of the login routine (with no
change to the source code of the login routine). The Trojan horse placed in
the compiler was capable of reinserting itself into the object code of
successive recompilations of the compiler itself, and thus was itself
survivable! This suggests that compilers have some special problems of
their own, as considered in Section 5.10.
Table 1: Illustrative Compromises
|
Layer of | Compromise | Compromise | Compromise |
|
abstraction
| from outside: | from within: | from below: |
|
| Needs exogirding | Needs endogirding | Needs undergirding |
|
Outside | | Acts of God, | Chernobyl-like |
|
environment | | earthquakes, | disasters caused |
|
| | lightning, etc. | by users or operators |
|
User | Masqueraders | Accidental mistakes; | Application system outage |
|
| | Intentional misuse | or service denial |
|
Application | Penetrations of | Programming errors | Application (e.g., DBMS) |
|
| application service | in application code | undermined within |
|
| integrity | | operating systems (OSs) |
|
Middleware | Penetration of | Trojan horsing of | Subversion of middleware |
|
| Web and DBMS | Web and DBMS | from OS or network |
|
| servers | servers | operations |
|
Networking | Penetration of | Trojan horsing of | Capture of crypto |
|
| routers, firewalls; | network software | keys within the OS; |
|
| Denials of service | | Exploitation of lower |
|
| | | protocol layers |
|
Operating | Penetrations of OS by | Flawed OS software; | OS undermined from |
|
system | unauthorized users | Trojan-horsed OS; | within hardware: |
|
| | Tampering by | faults exceeding fault |
|
| | privileged | tolerance; hardware |
|
| | processes | flaws or sabotage |
|
Hardware | Externally generated | Bad hardware design | Internal power |
|
| electromagnetic or | and implementation; | irregularities |
|
| other interference; | Hardware Trojan horses; | |
|
| External power- | Unrecoverable faults; | |
|
| utility glitches | Internal interference | |
|
Inside | Malicious or | Internal power supplies, | |
|
environment | accidental acts | tripped breakers, | |
|
| | UPS/battery failures | |
|
|
Table 1 summarizes some properties whose nonsatisfaction
could potentially compromise system behavior, by compromising
confidentiality, integrity, availability, real-time performance, or
correctness of application software, either accidentally or intentionally.
To illustrate such compromises, the table also indicates possible
compromises -- whether they involve modification (tampering) or not --
that can occur from outside, from within, or from below, for each
representative layer of abstraction. The distinctions are not always
precise: a penetrator may compromise from outside, but once having
penetrated, is then in position to compromise from below or from within.
Thus, one type of compromise may be used to enable another. For this
reason, the table characterizes only the primary modes of compromise. For
example, a user entering through a resource access control package such as
RACF or CA-TopSecret, or through a superuser mechanism, and gaining
apparently legitimate access to the underlying operating system may then be
able to undermine both operating-system integrity (compromise from within)
and database integrity (compromise from below if through the operating
system), even though the original compromise is from outside. Similarly, a
software implementation of an encryption algorithm or of a cryptographic
check sum used as an integrity seal can be compromised by someone gaining
access to the unencrypted information in memory or to the encryption
mechanism itself, at a lower layer of abstraction. A user exploiting an
Internet Protocol router vulnerability may
initially be able to compromise a system from within the logical layer of
its networking software, but subsequently may create further compromises
from outside or below.
The Thompson compiler Trojan horse is a
particularly interesting case, because it may not normally be thought of as
compromise from below if the compiler is not understood to be something that
is depended upon for its correct behavior. Indeed, it is a very bad policy
to use an untrustworthy compiler to generate an operating system, and
therefore the compiler must be considered "below" (or else the
dependence must be considered as a violaton of layered trustworthiness,
as in MLX). Indeed, the entire software development process is a huge
opportunity for compromising the integrity of the resulting system
(intentionally or accidentally).
From the table, we observe that a system may be inherently compromisible, in
a variety of ways. The purpose of system design is not to make the system
completely noncompromisible (which is impossible), but rather to provide
some assurance that the most likely and most devastating compromises are
properly addressed by designs, architectures, development processes, and
operational practices, and -- if compromises do occur -- to be able to
determine the causes and effects, to limit the negative consequences, and to
take appropriate actions. Thus, it is desirable to provide underlying
mechanisms that are inherently difficult to compromise, and to build
consistently on those mechanisms. On the other hand, in the presence of
underlying mechanisms that are inherently compromisible, it may still be
possible to use Byzantine-like strategies to make the higher-layer
mechanisms less compromisible. However, flaws that permit compromise of the
underlying layers are inherently risky unless the effects of such
compromises can be strictly contained.
Protection against the three forms of compromise noted in
Section 1.3 -- compromise from outside, compromise from
within, and compromise from below -- are referred to in this report as exogirding, endogirding, and
undergirding, respectively -- that is, providing
outside barrier defenses, internal defenses, and defenses that protect
underlying mechanisms, respectively.6
In general, all three types of protection are necessary. Various approaches
are considered in Chapters 5, 7,
and 8. For the purposes of this chapter, just a few
illustrative examples are given here, relating to a few of the layers of
abstraction shown in Table 1. As indicated by this summary,
some of the techniques are quite different from one case to another,
although other techniques are more generically applicable.
- Exogirding: Domain architectures protecting systems from
their users, firewalls protecting one system from another, authentication
mechanisms preventing penetrations at different layers, use of
encryption for confidentiality and integrity of information transmission,
electromagnetic shielding
- Endogirding: compile-time and run-time checks such as bounds
checks and type checks to minimize the effects of errant programs, fault
tolerance, integrity checks to prevent and detect Trojan horses, use of
encryption for confidentiality and integrity of stored information,
monitoring of program behavior to detect misuse or aberrant system operation
- Undergirding: use of secure kernels to prevent higher-layer
compromises, trustworthy operating systems and high-integrity hardware to
support critical software functionality, use of special-purpose hardware
(e.g., co-processors and cryptographic engines) to aid less
trustworthy higher-layer systems
Some of the many stages of system development and use during which risks may
arise are listed below, along with a few examples of what might go wrong (and,
in most cases, what has gone wrong in the past). This list summarizes
some of the main threats.
Section 1.6 gives examples of specific illustrative cases.
Problems in the system development process involve people at each stage,
and are illustrated by the following examples:
- System conceptualization: inappropriate use of
technology when the risks were actually too great, and absence of
computerization when it would have been essential
- Requirements definition: erroneous, incomplete, and inconsistent
requirements
- Models: false assumptions about the physical world, the operating
environment, and human behavior
- System design: fundamental misconceptions and design flaws
- Implementation: program bugs, omissions, and Trojan horses causing
unanticipated effects
- Support systems: poor programming languages, faulty compilers and
debuggers, and misleading development tools whose use might permit the
development of weak systems
- Testing and verification: incomplete testing,
incomplete or erroneous verification
- Evolution: sloppy maintenance, misconceived system upgrades,
introduction of new flaws in attempts to fix old flaws
- Stagnation: infeasible expansion of a system beyond its initial
requirements (e.g., because software bloat, loss of key personnel, or
unavailability of compatible hardware impedes upgrades and retrofits)
- Decommission: premature removal of a primary or backup facility,
hidden dependence on an old version that is no longer available but that
is required (e.g., for compatibility)
Problems in system operation and use involve people and external factors,
and are illustrated by the following examples:
- Hardware malfunction, due to
- Environmental factors such as lightning, earthquakes,
extreme temperatures, electromagnetic and other interference including
cosmic radiation and sunspot activity,
animals (sharks, rats, and squirrels are included in the case histories, for
example) and many natural disasters.
For recent House testimony on some of the
risks
of RF interference, see
Radio Frequency Weapons and Proliferation: Potential Impact on the
Economy,
http://www.house.gov/jec/hearings/02-25-8h.htm). Systems developed
in the former Soviet Union were previously discussed by General Schweitzer
(http://jya.com/rfw-jec.htm).
- Loss of electrical power
- Component malfunction: aging, transient behavior, or inadequate design
- Software misbehavior: for example, due to problems in the system
development process, as noted above
- Human behavior in system use, whether in system operators,
administrators, staff, users, or unsuspecting bystanders, for example, in
- Installation: improper configuration, incompatible versions, erroneous
parameter settings, or linkage errors
- Misuse of the overall environment or the computer systems, including
- Unintentional misuse (including untimely use): entry of improper
inputs, misinterpretation of outputs, or execution of the wrong function
- Intentional misuse: penetration by unauthorized or unintended
users, misuse by authorized users, insertion of Trojan horses, or fraud
The last subcategory -- intentional misuse -- represents a particular
worrisome area of concern and is considered in Section 2.1.
We consider here just a few illustrative problems that have been
encountered in the past, suggesting the rather pervasive nature of the
survivability problem -- with many diverse causes and effects.
The first seven items listed below involved massive outages triggered
accidentally by local events, each of which compromised overall system and
network survivability.
The eighth was triggered by a single human error, but the effects
propagated throughout the San Francisco Bay Area.
The ninth involved a local outage that was quickly
corrected, but whose after-effects continued to propagate for many hours.
These cases involved human factors as well as other causes.
- ARPAnet collapse.
On 27 October 1980, the ARPAnet accidentally shut itself down globally.
Collapse, analysis, and recovery took about
4 hours. The problem was due to a hardware design omission (the absence of
parity checking in memory), hardware failures (the coexistence of two bogus
versions of a node status message resulting from memory errors), and an
overly generous algorithm for garbage collection of status messages. Each
node in the network became contaminated, memory overflowed, and the network
became useless.
- Internet service outages.
On 23 April 1997, Internet service providers lost contact with nearly
all U.S. Internet backbone operators. As a result, much of the
Internet was disconnected, some parts for 20 minutes, some for as long as 3
hours. The problem was attributed to MAI Network Services in McLean,
Virginia (www.mai.net), which gave Sprint and other backbone providers
incorrect routing tables, the result of which was that MAI was flooded
with traffic. In addition, the InterNIC directory incorrectly listed
Florida Internet Exchange as the owner of the routing tables. A "technical
bug" was also blamed for causing one of MAI's Bay Networks routers not to
detect the erroneous data. Furthermore, the routing tables Sprint received
were designated as optimal, which gave them higher credibility than
otherwise. Something like 50,000 routing addresses all pointed to
MAI.
- Internet domain
blockage. On 16 July 1997, Network Solutions
Inc. attempted to run the autogeneration of the top-level domain zone files,
which resulted from the failure of a program converting Ingres data into the
DNS tables, corrupting the .com and .net domains in the top-level domain
name server (DNS), maintained by NSI. Quality-assurance alarms were
evidently ignored and the corrupted files were released at 2:30 a.m.
EDT on 17 July -- with widespread effects. Other servers copied the
corrupted files from the NSI version. Corrected files were issued 4 hours
later, although various problems lingered.
- Long-distance calling blockage.
On 15 January 1990, the AT&T long-distance network suffered a nationwide
congestion problem that effectively shut down long-distance calling for more
than 9 hours. The problem was traced to a flaw in the recovery
software in each Signaling System 7 switch
that enabled each neighboring switch to crash when receiving traffic from a
newly crashed switch that had attempted to reinitialize itself. This crash
phenomenon propagated repeatedly throughout the entire network, and resulted
in almost no long-distance calls getting through.
- AT&T frame-relay outage. On 13 April 1998, AT&T's packet-data
frame-relay network collapsed throughout the United States. This network is
used for business customers, credit-card activities, bank transactions,
travel reservations, among others. The outage resulted from a faulty
software upgrade of a circuit card in a single switch, and effectively
created a black-hole-like path that shut down the entire network, in some
cases for as long as 26 hours.
- Western power outages. Electric power outages on 2 July 1996
affected at least 10 Western states. The outage was reportedly triggered by
a single tree in Idaho coming in contact with a transmission line, although
many other events conspired to create the massive propagation. Further
Extensive West Coast outages on 10 August 1996 affected 8 million customers
in 8 states, Canada, and Baja (Mexico).
- Galaxy IV satellite failure. The loss of the orientation of the
Hughes HS601 Galaxy IV satellite on 19 May 1998 (resulting from a
malfunction of both the primary system and the backup system) caused
as many as 40 million pager systems to fail across the United States, as
well as the loss of many other services also provided by that satellite --
affecting point-of-sale devices, hospital operations, and other activities.
It took several days to reconfigure the communications, using other
satellite facilities. (Two other failures of Hughes HS601 satellites
occurred -- Galaxy VII on 14 June 1998 and another on 4 July 1998 affecting
3.7 million DirecTV subscribers -- but the backup systems worked properly
in both of those cases.)
- San Francisco Bay Area power outage. At 8:15 a.m. on the
morning of 8 December 1998, a power surge resulted from an attempt to
reconnect a power station to the grid -- but without first having removed a
temporary ground connection. More than a million people were affected, some
of whom were without power for as long as 8 hours. San Francisco Airport
was closed for 1.5 hours. The Pacific Stock Exchange, Rapid Transit, ATMs,
offices, and hospitals were without power. Various secondary effects arose;
for example, the surge caused SRI only a momentary blip, but that was enough
to take many computers down for hours. (See RISKS, vol. 20, nos. 11 and
12.)
- Kansas City power outage triggers national air-traffic snarl. At
the Kansas City (Olathe) Air Route Traffic Control Center, at 9:03 a.m. CST on 18 December 1997, a technician routed power through half of the
redundant "uninterruptible" power system, preparatory to performing annual
preventive maintenance on the other half. Unfortunately, he apparently
pulled the wrong circuit board, and took down the remaining half as well.
The maintenance procedure also bypassed the standby generators and emergency
batteries. The resulting outage took out radio communications with
aircraft, radar information, and phone lines to other control centers.
Power was out for only 4 minutes, communications were restored shortly
thereafter, and backup radar was working by 9:20 a.m. However, at
least 300 planes were in the Olathe-controlled airspace at the time, and the
effects piled up nationwide. Hundreds of flights were canceled, diverted,
or delayed. Many delays were as long as 2 hours, with some delays
continuing into the evening.
The remaining cases noted here are examples of other types of accidental
survivability problems, although less widespread in their resulting effects.
- Navy on-board problems. Two Ticonderoga-class cruisers
(the USS Hue City and USS Vicksburg) were put out of commission because of
difficulties in integrating new on-board weapon-control system software,
involving 8 million lines of code (RISKS vol. 19, no. 86). The Navy's
Windows NT based Smart Ship technology is also causing potentially serious
difficulties. For example, in September 1997, the Aegis missile cruiser USS
Yorktown suffered a systems failure during maneuvers off the coast of Cape
Charles, Virginia, as the result of an unchecked divide-by-zero in an NT
application. The ship was dead in the water for 2 hours and 45 minutes. An
earlier loss of propulsion also occurred on 2 May 1997 (RISKS vol. 19,
no. 88).
- Tomahawk missile abort.
On 2 August 1986, a Tomahawk missile suddenly made a soft landing in the
middle of an apparently successful launch. The abort sequence was
accidentally triggered as a result of a bit dropped by the hardware,
possibly due to stray electromagnetic radiation (a cosmic ray hit or other
form of interference?)
on the computer.7
(On 8 December 1986, an earlier Tomahawk cruise missile crashed during
launch because its midcourse program had been accidentally erased on
loading.)
- Black Hawk helicopter crashes.
In tests, radio waves
triggered a complete hydraulic failure of a UH-60 Blackhawk helicopter,
effectively generating false electronic commands. Twenty-two people were
killed in five Black Hawk crashes before shielding was added to the
electronic controls. Failures in other systems have also been linked to
electromagnetic interference and rotor control failures. (On 13 December
1998, 60 Minutes reported that repeated notices of failures had not
been acted upon.)
- Other electromagnetic interference affecting defense systems.
Patriot defenses and
Predator unmanned aerial vehicles reportedly
cannot work properly in certain foreign countries (Germany, Japan, South
Korea, and Bahrain are particular instances) because of frequency clashes.
For example, Patriot missile system radios, radars, and data-link terminals
clash with Korean cellular phones; pagers of U.S. forces clash with Japanese
aeronautical systems; crib monitors used on U.S. bases clash with German
telephone service. In Bahrain, SPS-40 and SPS-49 radars are unusable
because of interference from the national telecommunications services. (See
the Defense Week issue released 26 October 1998.)
- Phobos probe losses.
The Phobos I probe was doomed by a faulty software update, which caused a
loss of solar orientation, which in turn resulted in discharge of the solar
batteries. Phobos II encountered a similar fate when the automatic antenna
reorientation failed, causing a permanent loss of communications. Several
similar catastrophes are linked to faulty maintenance.
- Other space cases.
Other cases of nonsurvivable systems are also worth noting: (1)
the cosmic ray bombardment of TDRS in 1984 (which cut the Challenger's
communications in half), (2) the ill-fated Challenger launch in 1986 due to
an O-ring weak-link problem, (3) sunspot activity that affected the
computers and altered Skylab's orbit in 1979, and (4) an Atlas-Centaur whose
program was altered by lightning.
- Patriot missile defense. In addition,
many cases have occurred in which system functionality continued but the
overall performance was no longer consistent with expectations. An example
is provided by the Patriot software whose clock drifted far enough to
prevent it from adequately tracking the incoming SCUD missile that hit the
Dhahran barracks. It might be said that the computer hardware survived (it
continued to do its computations), but the necessary application
functionality did not survive over the 100 hours that the system had been
running (instead of the 14 hours specified by the requirements). This
illustrates the fact that we must always define survivability with respect
to specific requirements or expectations of the application, not just of
the system software.
- Cable cuts. In numerous cases, a single
cable was cut, with amazingly dispersed effects -- in one case resulting in
prolonged delays at all three airports in the New York City area. There
have also been cases in which multiple circuits have been severed
simultaneously. In 1986, the entire Northeast of the United States was
separated from the rest of the ARPAnet because all seven circuits actually
went through the same conduit in White Plains. In 1991, two fiber-optic
lines in Annandale, Virginia, that had been intentionally placed in separate
conduits to avoid such single weak links were simultaneously severed,
affecting 80,000 telephone circuits (including the Pentagon and news
services). In the last week of June 2000, during construction preparing for
connecting the Bay Area Rapid Transit (BART) to the San Francisco Airport, a
droplet of welding material in a manhole just south of San Francisco caused
a fire that destroyed portions of 27 cables, wiping out telephone service to
25,000 customers; the process of replacing 800 feet of cable and correctly
reconnecting the many thousands of wires was expected to take at least two
weeks.
Next, we consider a few cases attributed to malicious acts.
- Denial-of-service attacks on ISPs and servers. The PANIX ISP
suffered a severe denial of service resulting from a syn flooding
attack (RISKS-18.45,
http://catless.ncl.ac.uk/Risks/18.45.html).
WebCom did also (RISKS-18.69,
http://catless.ncl.ac.uk/Risks/18.69.html).
Such flooding and resource-exhaustion attacks are relatively easy to
perpetrate, because they do not require any authentication within the
attacked systems.
- Distributed denials of
service. Within a three-day period in
February 2000, attacks were targeted at many sites -- including Yahoo,
Amazon, eBay, CNN.com, Buy.com, ZDNet, E*Trade, and Excite.com. These
attacks were launched from intermediary sites (zombies) that had been
compromised externally, which made it difficult to track the perpetrators.
(See RISKS-20.79,
http://catless.ncl.ac.uk/Risks/20.79.html.)
- Personal computer virus attacks.
(See Section 2.1.2 for background on personal-computer viruses.) In
the personal computer world of Microsoft software, there is an
extraordinarily large number of user-propagated Trojan horses, usually
referred to as viruses, and herein referred to as PC viruses even though
they are technically not viruses (because their propagation requires human
actions). Some of these viruses have resulted in serious consequences such
as disabling systems. Of particular noteworthiness are the Word macros in
the Melissa attacks (RISKS-20.26,28-34,39,40,44,45) and the scripting
attachments in the ILOVEYOU virus (4 May 2000, RISKS-20.88 and later) and
its copycat spinoffs, both of which exploited fundamental weaknesses in
Microsoft Outlook software relating to the executable nature of attachments.
(The ILOVEYOU virus actually infected some classified systems and networks,
which of course is simply not supposed to happen, although we know that
operationally the complete isolation of classified systems is a myth.)
- Australian sabotage. A massive communications blackout in
Sydney, Australia, on 22 November 1987 was caused by a knowledgeable
saboteur who had been a former Telecom employee. The attacker severed 24
main cables in 10 different locations, which had been carefully selected to
have maximum effect. This attack knocked out 35,000 telephone lines,
shutting down many computers, banks, telephone offices, ATMs, point-of-sale
systems, stores, telexes, facsimile, and betting-office services. Because
all international services are routed through Sydney, the effects were not
just local. It was suspected that the attack had been based on 2-year-old
information, because the same attack 2 years earlier would have been
completely devastating.8
- San Francisco blackout attributed to sabotage. 126,000 customers
in northern San Francisco experienced a power outage for as long as 3.5
hours beginning at 6:15 a.m. on 25 October 1997, when five
transformers at a single power station stopped working. The FBI
counterterrorism unit investigated what it considered to be sabotage,
whereby 39 of the 42 switches at one substation appeared to have been
manually opened. (See RISKS-19.42.)
- Hacked Web sites. Many organizations now depend on their
Web sites for dissemination of timely and reliable information.
However, serious risks exist of databases being altered by intruders.
Many cases of pranks and nuisance acts also clearly demonstrate how
vulnerable systems are to attack.
Such intrusions continue to occur, because the systems
remain vulnerable. Here are a few recent cases. Although they were
mostly intended as pranks rather than malicious activities, it is
clear that penetrations were involved, and the damages could have been
much worse -- especially in cases in which there were no firewalls
isolating these Web sites from internal systems, or where the firewalls
were improperly configured.
+ CIA (RISKS-18.49)
+ Justice Department (RISKS-18.35)
+ FBI (RISKS-20.43)
+ Department of Interior (RISKS-20.43)
+ Three Army Web sites (RISKS-19.63)
+ Air Force (RISKS-18.64)
+ NASA (RISKS-18.88)
+ Space Station problem reporting database (RISKS-20.47-48)
+ National Collegiate Athletic Association (RISKS-18.88)
+ U.S. Information Agency (RISKS-20.18)
+ George W. Bush campaign site, photo replaced (RISKS-20.64)
+ Gallup Organization (RISKS-20.83)
+ Japanese Government Science and Technology Agency,
census data erased (RISKS-20.77)
+ Swedish meatpacker site (RISKS-19.14)
+ Swedish National Board of Health and Welfare (RISKS-20.87)
+ National Hockey League denial of service, 5-day outage (RISKS-20.89)
In addition, a U.S. General Accounting Office study uncovered some rather
egregious security vulnerabilities in the Web site of the Environmental
Protection Agency. When threatened with exposure of those vulnerabilities
by an environmentally unsympathetic Congressman, the EPA chose to remove its
Web site from the Net altogether (RISKS-20.77).
There was also a report in Federal Computer Week that a DoD bloodtype
database had been subverted and bloodtype data altered (RISKS-19.97);
however, that report was subsequently corrected: no such penetration had
occurred, although a red team had identified the possibility of such an
attack and contemplated its possible effects (