buses.dvi Abstract Avionics and control systems for aircraft use distributed, fault-tolerant computer systems to provide safety-critical functions such as ight and engine control. These systems are becoming modular, meaning that they are based on standardized architectures and components, and integrated, meaning that some of the components are shared by different functionsof possibly different criticality levels. The modular architectures that support these functions must provide mechanisms for coordinating the distributed components that provide a single function (e.g., distributing sensor readings and actuator commands appropriately, and assisting replicated components to perform the function in a fault-tolerant manner), while protecting functions from faults in each other. Such an architecture must tolerate hardware faults in its own components and must provide very strong guarantees on the correctness and reliability of its own mechanisms and services. One of the essential services provided by this kind of modular architecture is communication of information from one distributed component to another, so a (physical or logical) communication bus is one of its principal components, and the protocols used for control and communication on the bus are among its principal mechanisms. Consequently, these architectures are often referred to as buses (or databuses), although this term understates their complexity, sophistication, and criticality. The capabilities once found in aircraft buses are becoming available in buses aimed at the automobile market, where the economies of scale ensure low prices. The low price of the automobile buses then renders them attractive to certain aircraft applicationsprovided they can achieve the safety required. In this report, I describe and compare the architectures of two avionics and two automobile buses in the interest of deducing principles common to all of them, the main differences in their design choices, and the tradeoffs made. The avionics buses considered are the Honeywell SAFEbus (the backplane data bus used in the Boeing 777 Airplane Information Management System) and the NASA SPIDER (an architecture being developed as a demonstrator for certication under the new DO-254 guidelines); the automobile buses considered are the TTTech Time-Triggered Architecture (TTA), recently adopted by Audi for automobile applications, and by Honeywell for avionics and aircraft control functions, and FlexRay, which is being developed by a consortium of BMW, DaimlerChrysler, Motorola, and Philips. I consider these buses from the perspective of their fault hypotheses, mechanisms, services, and assurance. i ii Contents 1 Introduction 1 2 Comparison 11 2.1 The Four Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 SAFEbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 TTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 SPIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.4 FlexRay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Fault Hypothesis and Fault Containment Units . . . . . . . . . . . . . . . . 13 2.2.1 SAFEbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 TTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3 SPIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.4 FlexRay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Clock Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 SAFEbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.2 TTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.3 SPIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.4 FlexRay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Bus Guardians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 SAFEbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 TTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.3 SPIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.4 FlexRay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Startup and Restart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.1 SAFEbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.2 TTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.3 SPIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.4 FlexRay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6.1 SAFEbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 iii 2.6.2 TTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6.3 SPIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6.4 FlexRay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.7 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.7.1 SAFEbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.7.2 TTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.7.3 SPIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.7.4 FlexRay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.8 Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.8.1 SAFEbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.8.2 TTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.8.3 SPIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.8.4 FlexRay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3 Conclusion 41 iv List of Figures 1.1 Generic Bus Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Bus Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Star Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 SPIDER Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 v vi Chapter 1 Introduction Embedded systems generally operate as closed-loop control systems: they repeatedly sample sensors, calculate appropriate control responses, and send those responses to actuators. In safety-critical applications, such as y- and drive-by-wire (where there are no direct connections between the pilot and the aircraft control surfaces, nor between the driver and the car steering and brakes), requirements for ultra-high reliability demand fault tolerance and extensive redundancy. The embedded system then becomes a distributed one, and the basic control loop is complicated by mechanisms for synchronization, voting, and redundancy management. Systems used in safety-critical applications have traditionally been federated, meaning that each function (e.g., autopilot or autothrottle in an aircraft, and brakes or suspension in a car) has its own fault-tolerant embedded control system with only minor interconnections to the systems of other functions. This provides a strong barrier to fault propagation: because the systems supporting different functions do not share resources, the failure of one function has little effect on the continued operation of others. The federated approach is expensive, however (because each function has its own replicated system), so recent applications are moving toward more integrated solutions in which some resources are shared across different functions. The new danger here is that faults may propagate from one function to another; partitioning is the problem of restoring to integrated systems the strong defenses against fault propagation that are naturally present in federated systems. A dual issue is that of strong composability: here we would like to take separately developed functions and have them run without interference on an integrated system platform with negligible integration effort. The problems of fault tolerance, partitioning, and strong composability are challenging ones. If handled in an ad-hoc manner, their mechanisms can become the primary sources of faults and of unreliability in the resulting architecture [Mac88]. Fortunately, most aspects of these problems are independent of the particular functions concerned, and they can be handled in a principled and correct manner by generic mechanisms implemented as an architecture for distributed embedded systems. 1 One of the essential services provided by this kind of architecture is communication of information from one distributed component to another, so a (physical or logical) communication bus is one of its principal components, and the protocols used for control and communication on the bus are among its principal mechanisms. Consequently, these architectures are often referred to as buses (or databuses), although this term understates their complexity, sophistication, and criticality. In truth, these architectures are the safety-critical core of the applications built above them, and the choice of services to provide to those applications, and the mechanisms of their implementation, are issues of major importance in the construction and certication of safety-critical embedded systems. Capabilities and considerations once encountered only in buses for civil aircraft are now found in buses aimed at the automobile market, where the economies of scale ensure low prices. The low price of the automobile buses then renders them attractive to certain aircraft applicationsprovided they can achieve the safety required. In this report, I describe and compare the architectures of two avionics and two automobile buses in the interest of deducing principles common to all of them, the main differences in their design choices, and the tradeoffs made. The avionics buses considered are the Honeywell SAFEbus [ARI93, HD92] (the backplane data bus used in the Boeing 777 Airplane Information Management System) and the NASA SPIDER [Min00] (an architecture being developed as a demonstrator for certication under the new DO-254 guidelines); the automobile buses considered are the TTTech Time-Triggered Architecture (TTA) [TTT99, KG94], recently adopted by Audi for automobile applications, and by Honeywell for avionics and aircraft controls functions, and FlexRay [B + 01], which is being developed by a consortium of BMW, DaimlerChrysler, Motorola, and Philips. All four of the buses considered here are primarily time triggered; this is a fundamental design choice that inuences many aspects of their architectures and mechanisms, and sets them apart from fundamentally event-triggered buses such as Controller Area Network (CAN), Byteight, and LonWorks. Time triggered means that all activities involving the bus, and often those involving components attached to the bus, are driven by the passage of time (if it's 20 ms since the start of the frame, then read the sensor and broadcast its value); this is distinguished from event triggered, which means that activities are driven by the occurrence of events (if the sensor reading changes, then broadcast its new value). A time-triggered system interacts with the world according to an internal schedule, whereas an event-triggered system responds to stimuli that are outside its control. The time-triggered and event-triggered approaches to systems design nd favor in different application areas, and each has strong advocates. For integrated, safety-critical systems, however, the time-triggered approach is generally preferred. The reason is that an integrated system brings different applications (functions in avionics terms) together whereas in a safety-critical system we usually prefer them to be kept apart! This is so that a failure in one application cannot propagate and cause failures in other applications; such protection against fault propagation is called partitioning, and it is most rigorously 2 achieved (for reasons I will explain shortly) in time-triggered systems. Partitioning is a necessity in integrated safety-critical systems, but once achieved it also creates new opportunities. First, it simplies the construction of fault-tolerant applications: such applications must be replicated across separate components with independent failure characteristics and no propagation of failures between them. Traditional federated architectures are typically hand-crafted to achieve these properties, but partitioning provides them automatically (indeed, they are the same as partitioning, reinterpreted to apply to the redundant components of a single application, rather than across different applications). Second, partitioning allows single applications to be deconstructed into smaller components that can be developed to different assurance levels: this can reduce costs and can also allow provision of new, safety-related capabilities. For example, an autopilot has to be developed to DO-178B assurance Level A [RTC92]; this is onerous and expensive, and a disincentive to introduction of desirable additional capabilities, such as extensive built-in self test (BIST). If the BIST could run in a separate partition, however, its assurance might be reduced to Level C, with corresponding reduction in its cost of development. Third, although the purpose of partitioning is to exclude fault propagation, it has the concomitant benet that it promotes composability. A composable design is one in which individual applications are unaffected by the choice of the other applications with which they are integrated: an autothrottle, for example, could be developed, tested, and (in principle) certied, in isolationin full condence that it will perform identically when integrated into the same system as an autolander and a coffee maker. Partitioning and composability concern the predictability of the resources and services perceived by the clients (i.e., applications and their subfunctions) of an architecture; predictability has two dimensions: value (i.e., logically correct behavior) and time (i.e., services are delivered at a predictable rate, and with predictable latency and jitter). It is temporal (time) predictabilityespecially in the presence of faultsthat is difcult to achieve in event-triggered architectures, and thereby leaves time triggering as the only choice for safety-critical systems. The problem in event-driven buses is that events arriving at different nodes may cause them to contend for access to the bus, so some form of media access control (i.e., a distributed mutual exclusion algorithm) is needed to ensure that each node eventually is able to transmit without interruption. The important issue is how predictable is the access achieved by each node, and how strong is the assurance that the predictions remain true in the presence of faults. Buses such as Ethernet resolve contention probabilistically and therefore can provide only probabilistic guarantees of timely access, and no assurance at all in the presence of faults. Buses for embedded systems such as CAN [ISO93], LonWorks [Ech99], or Probus (Process Field Bus) [Deu95] use various priority, preassigned slot, or token schemes to resolve contention deterministically. In CAN, for example, the message with the lowest number always wins the arbitration and may therefore have to wait only for the current message to nish (though that message may be retransmitted in the case of transmission failure), while other messages also have to wait for any lower-numbered messages. Thus, although 3 contention is resolved deterministically, latency increases with load and can be bounded with only probabilistic guaranteesand these can be quite weak in the presence of faults that cause some nodes to make excessive demands, thereby reducing the service available to others. Event-triggered buses for safety-critical applications add various mechanisms to limit such demands. ARINC 629 [ARI56] (an avionics data bus used in the Boeing 777), for example, uses a technique sometimes referred to as minislotting that requires each node to wait a certain period after sending a message before it can contend to send another. Even here, however, latency is a function of load, so the Byteight protocol [Byt] developed by BMW extends this mechanism with guaranteed, preallocated slots for critical messages. At this point, however, we are close to a time-triggered bus, and if we were to add mechanisms to provide fault tolerance and to contain the effects of node failures, then we would arrive at a design similar to one of the time-triggered buses that is the focus of this comparison. In a time-triggered bus, there is a static preallocation of communication bandwidth in the form of a global schedule: each node knows the schedule and knows the time, and therefore knows when it is allowed to send messages, and when it should expect to receive them. Thus, contention is resolved at design time (as the schedule is constructed), when all its consequences can be examined, rather than at runtime. Because all communication is time triggered by the global schedule, there is no need to attach source or destination addresses to messages sent over the bus: each node knows the sender and intended recipients of each message by virtue of the time at which it was sent. Elimination of the address elds not only reduces the size of each message, thereby greatly increasing the message bandwidth of the bus (messages are typically short in embedded control applications), but it also eliminates a potential source of serious faults: the possibility that a faulty node may send messages to the wrong recipients or, worse, may masquerade as a sender other than itself. Time-triggered operation provides efciency, determinism, and partitioning, but at the price of exibility. To reduce this limitation, most time-triggered buses are able to switch among several schedules. The different schedules may be optimized for different missions, or phases of a mission (e.g., startup vs. cruise), for operating in a degraded mode (e.g., when some major function has failed), or for optional equipment (e.g., for cars with and without traction control). In addition, some make provision for event-triggered services, either piggybacked on time-triggered mechanisms, or timesharing between time- and event-triggered operation. Flexibility of operation is considered in more detail in Section 2.7. Figure 1.1 portrays a generic bus architecture: application programs run in the host computers, while the interconnect medium provides broadcast communications; interface devices connect the hosts to the interconnect. All components inside the dashed box are considered part of the bus. Realizations of the interconnect may be a physical bus, as shown in Figure 1.2, or a centralized hub, as shown in Figure 1.3. The interfaces may be physically proximate to the hosts, or they may form part of a more complex central hub, as shown in Figure 1.4. 4 Host Interface Host Interface Host Interface Host Interface Interconnect Figure 1.1: Generic Bus Conguration Safety-critical aerospace functions (those at Level A of DO-178B) are generally required to have failure rates less than 10 9 per hour, and an architecture that is intended to support several such functions should provide assurance of failure rates better than 10 10 per hour. 1 Consumer-grade electronics devices have failure rates many orders of magnitude worse than this, so redundancy and fault tolerance are essential elements of a bus architecture. Redundancy may include replication of the entire bus, of the interconnect and/or the interfaces, or decomposition of those elements into smaller subcomponents that are then replicated. These topics are considered in more detail in Section 2.1. Fault tolerance takes two forms in these architectures: rst is that which ensures that the bus itself does not fail, second is that which eases the construction of fault-tolerant ap1 An explanation for this gure can be derived by considering a eet of 100 aircraft, each ying 3,000 hours per year over a lifetime of 33 years (thereby accumulating about 10 7 ight-hours). The requirement is that no fault should lead to loss of an aircraft in the lifetime of the eet [FAA88]. If hazard analysis reveals ten potentially catastrophic failure conditions in each of ten systems, then the budget for each is about 10 9 [LT82, page 37]. Similar calculations can be performed for carshigher rates of loss are accepted, but there are vastly more of them. See the MISRA guidelines [MIS94]. Also note that failure includes malfunction as well as loss of function. 5 Host Interface Host Interface Host Interface Host Interface Bus Figure 1.2: Bus Interconnect plications. Each of these mechanisms must be constructed and validated against an explicit fault hypothesis, and must deliver specied services (that may be specied to degrade in acceptable ways in the presence of faults). The fault hypothesis must describe the kinds (or modes) of faults that are to be tolerated, and their maximum number and arrival rate. These topics are considered in more detail in Section 2.2. Although the bus as whole must not fail, it may be acceptable for service to some specied number of hosts to fail in some specied manner (typically fail silent, meaning no messages are transmitted to or from the host); also, some host computers may themselves fail. In these circumstances (when a host has failed, or when the bus is unable to provide service to a host), applications software in other hosts must tolerate the failure and continue to provide the function concerned. For example, three hosts might provide an autopilot function in a triple modularly redundant (TMR) fashion, or two might operate in a master/shadow manner. To coordinate their fault-tolerant operation, redundant hosts may need to maintain identical copies of relevant state data, or may need to be notied when one of their members becomes unavailable. The bus can assist this coordination by providing application-independent services such as interactively consistent message reception and group membership. These topics are considered in more detail in Section 2.6. The global schedule of a time-triggered system determines when each node (i.e., host and corresponding interface) can access the interconnect medium. A global schedule requires a global clock and, for the reasons noted above, this clock must have a reliability of about 10 10 . It might seem feasible to locate a single hardware clock somewhere in the bus, then distribute it to the interfaces, and to achieve the required reliability by replicating the whole bus. The difculty with this approach is that the clocks will inevitably drift apart over time, to the point where the two buses will be working on different parts of the 6 Host Interface Host Interface Host Interface Host Interface Star Hub Figure 1.3: Star Interconnect schedule. This would present an unacceptable interface to the hosts, so it is clear that the clocks need to be synchronized. Two clocks do not sufce for fault-tolerant clock synchronization (we cannot tell which is wrong): at least four are required for the most demanding fault models [LMS85] (although three may be enough in certain circumstances [Rus94]). Rather than synchronize multiple buses, each with a single clock, it is better to replicate clocks within a single bus. Now, the hosts are full computersequipped with clocksso it might seem that they could undertake the clock synchronization. The difculty with this approach is that bus bandwidth is dependent on the quality of clock synchronization (message frames must be separated by a gap at least as long as the maximum clock skew), and clock synchronization is, in turn, dependent on the accuracy with which participants can estimate the differences between their clocks, or (for a different class of algorithms) on how quickly the participants can respond to events initiated by another clock. Specialized hardware is needed to achieve either of these with adequate performance, and this rules out synchronization by the hosts. Instead, the clocks and their synchronization mechanisms are made part of the bus; some designs locate the clocks within the interconnect, while others locate them within the interfaces. The topic of clock synchronization is considered in more detail in Section 2.3. A fault-tolerant global clock is a key mechanism for coordinating multiple components according to a global schedule. The next point to be considered is where the global schedule should be stored, and how the time trigger should operate. In principle, the schedule could be held in the host computers, which would then determine when to send messages to their interfaces for transmission: in this case, the interfaces would perform only low-level (physical layer) protocol services. However, effective bus bandwidth depends on the global 7 Interconnect Interface Interface Interface Interface Host Host Host Host Figure 1.4: SPIDER Interconnect schedule being tight, with little slack for protocol processing delays or interrupt latency. As with clock synchronization, this argues for hardware assistance in message timing and transmission. Hence, most of the buses considered here hold the schedule in the interface units, which then take on responsibility for most of the protocol services associated with the bus. Although buses differ in respect to the fault hypotheses they consider, all those that place responsibility for scheduling in the interface units must consider the possibility that some of these may fail in a way that causes them to transmit on the interconnect at the wrong time, thereby excluding or damaging properly timed transmissions from other interfaces. The worst manifestation of this failure is the so-called babbling idiot failure where a faulty interface transmits constantly, thereby compromising the operation of the entire bus. To control this failure, it is necessary to introduce another component, called a guardian that restricts the ability of an interface to transmit on the interconnect. A guardian should fail independently of the interfaces, and have independent access to the schedule and to the global time. There are many ways to implement the guardian functionality. For example, 8 we could duplicate each interface and arrange it so that the second instance acts as a check on the primary one (essentially, they are wired in series). The problem with this approach is the cost of providing a second duplicate for each interface. A lower-cost alternative reduces the functionality of the guardian at the expense of making it somewhat dependent on the primary interface. A third alternative locates the guardian functionality in the interconnect (specically, in the hub of a star conguration) where its cost can be amortized over many interfaces, albeit at the cost of introducing a single point of failure (which is overcome by duplicating the entire hub). These topics are considered in more detail in Section 2.4. It is nontrivial to start up a bus architecture that provides sophisticated services, especially if this must be performed in the presence of faults. Furthermore, it may be necessary to restart the system if faults outside its hypothesis cause it to fail: this must be done very quickly (within about 10 ms) or the overall system may go out of control. And it is necessary to allow individual hosts or interfaces that detect faults in their own operation to drop off the bus and later rejoin when they are restored to health. The topics of startup, restart, and rejoin are consider in Section 2.5. Any bus architecture that is intended to support safety-critical applications, with its attendant requirement for a failure rate below 10 10 , must come with strong assurance that it is t for the purpose. Assurance will include massive testing and fault injection of the actual implementation, and extensive reviews and analysis of its design and assumptions. Some of the analysis may employ formal methods, supported by mechanized tools such as model checkers and theorem provers [Rus95]. Industry or government guidelines for certication may apply in certain elds (e.g., [RTC92, RTC00] for airborne software and hardware, respectively, and [MIS94] for cars). These topics are considered in more detail in Section 2.8. The comparison between the four bus architectures is described in the next chapter; conclusions are presented in Chapter 3. 9 10 Chapter 2 Comparison We begin with a brief description of the topology and operation of each of the four bus architectures, and then consider each of them with respect to the issues introduced in the previous chapter. Within each section, the buses are considered in the order of their date of rst publication: SAFEbus, TTA, SPIDER, FlexRay. Certain paragraphs are labeled in the margin by keywords that are intended to aid navigation. 2.1 The Four Buses We describe the general characteristics of each of the bus architectures considered. 2.1.1 SAFEbus SAFEbus TM was developed by Honeywell (the principal designers are Kevin Driscoll and Ken Hoyme [HDHR91, HD92, HD93]) to serve as the core of the Boeing 777 Airplane Information Management System (AIMS) [SD95], which supports several critical functions, such as cockpit displays and airplane data gateways. The bus has been standardized as ARINC 659 [ARI93], and variations on Honeywell's implementation are being used or considered for other avionics and space applications. SAFEbus uses a bus interconnect topology similar to that shown in Figure 1.2; the interfaces (called Bus Interface Units, or BIUs) are duplicated, and the interconnect bus is quad-redundant; in addition, the whole AIMS is duplicated. Most of the functionality of SAFEbus is implemented in the BIUs, which perform clock synchronization and message scheduling and transmission functions. Each BIU acts as its partner's bus guardian by controlling its access to the interconnect. Each BIU of a pair drives a different pair of interconnect buses but is able to read all four; the interconnect buses themselves each comprise two data lines and one clock line. The bus lines and their drivers have the electrical characteristics of OR gates (i.e., if several different BIUs drive the same line at the same 11 time, the resulting signal is the OR of the separate inputs). Some of the protocols exploit this property. Of the architectures considered here, SAFEbus is the most matureit has been keeping Boeing 777s in the air for nearly a decadebut it is also the most expensive: the BIUs provide rich functionality and are fully duplicated at each node. 2.1.2 TTA The Time-Triggered Architecture (TTA) was developed by Hermann Kopetz and colleagues at the Technical University of Vienna [KG93, KG94]. Commercial development of the architecture is undertaken by TTTech and it is being deployed for safety-critical applications in cars by Audi and Volkswagen, and for ight-critical functions in aircraft and aircraft engines by Honeywell. Current implementations of TTA use a bus interconnect topology similar to that shown in Figure 1.2; I will refer to this version as TTA-bus. The next generation of TTA implementations will use a star interconnect topology similar to that shown in Figure 1.3; I will refer to this version as TTA-star . The interfaces are essentially the same in both designs; they are called controllers and implement the TTP/C protocol [TTT99] that is at the heart of TTA, providing clock synchronization, and message sequencing and transmission functions. The interconnect is duplicated and each controller drives both copies. In TTA-bus, each controller drives the buses through a bus guardian; in TTA-star, the guardian functionality is implemented in the central hub. TTA-star can also be arranged in distributed congurations in which subsystems are connected by hub-to-hub links. Of the architectures considered here, TTA is unique in being used for both automobile applications, where volume manufacture leads to very low prices, and aircraft, where a mature tradition of design and certication for ight-critical electronics provides strong scrutiny of arguments for safety. 2.1.3 SPIDER A Scalable Processor-Independent Design for Electromagnetic Resilience (SPIDER) is being developed by Paul Miner and colleagues at the NASA Langley Research Center as a research platform to explore recovery strategies for radiation-induced high-intensity radiated elds/electromagnetic interference (HIRF/EMI) faults, and to serve as a case study to exercise the recent design assurance guidelines for airborne electronic hardware (DO254) [RTC00]. The SPIDER interconnect is composed of active elements called Redundancy Management Units, or RMUs. Its topology can be organized either as shown in Figure 1.4, where the RMUs and interfaces (the BIUs) form part of a centralized hub, or as in Figure 1.3, where the RMUs form the hub, or similar to Figure 1.1, where the RMUs provide a distributed interconnect. The lines connecting hosts to their interfaces are optical ber, and the 12 whole system beyond the hosts (i.e., optical bers and the RMUs and BIUs) is called the Reliable Optical Bus (ROBUS). Clock synchronization and other services of SPIDER are achieved by distributed algorithms executed among the BIUs and RMUs [Min00]. The scheduling aspects of SPIDER are not well documented as yet, but the bus guardian functionality is handled in the RMUs following an approach due to Palumbo [Pal96]. SPIDER is an interesting design that uses a different topology and a different class of algorithms from the other buses considered here. However, its design and implementation are still in progress, and so it is omitted from some comparisons. I hope to increase coverage of SPIDER as more details become available. 2.1.4 FlexRay FlexRay, which is being developed by a consortium including BMW, DaimlerChrysler, Motorola, and Philips, is intended for powertrain and chassis control in cars. It differs from the other buses considered here in that its operation is divided between time-triggered and event-triggered activities. FlexRay can use either an active star topology similar to that shown in Figure 1.3, or a passive bus topology similar to that shown in Figure 1.2. In both cases, duplication of the interconnect is optional. Each interface (it is called a communication controller) drives the lines to its interconnects through separate bus guardians located with the interface. As with TTA-star, FlexRay can also be deployed in distributed congurations in which subsystems are connected by hub-to-hub links. Published descriptions of the FlexRay protocols and implementation are sketchy at present [B + 01] (see also the Web site www.flexray-group.com). FlexRay is interesting because of its mixture of time- and event-triggered operation, and potentially important because of the industrial clout of its developers. However, full details of its design are not available to the general public, so comparisons are based on the informal descriptions that have been published. 2.2 Fault Hypothesis and Fault Containment Units Any fault-tolerant system must be designed and evaluated against an explicit fault hypothesis that describes the number, type, and arrival rate of the faults it is intended to tolerate. The fault hypothesis must also identify the different fault containment units (FCUs) in the design: these are the components that can independently be aficted by faults. The division of an architecture into separate FCUs needs careful justication: there must be no propagation of faults from one FCU to another, and no common mode failures where a single physical event produces faults in multiple FCUs. We consider only physical faults (those caused by damage to, defects in, or aging of the devices employed, or by external disturbances such as 13 cosmic rays, and electromagnetic interference): design faults must be excluded, and must be shown to be excluded by stringent assurance and certication processes. It is a key assumption of all reliability calculations that failures of separate FCUs are statistically independent. Knowing the failure rate of each FCU, we can then use Markov or other stochastic modeling techniques [But92] to calculate the reliability of the overall architecture. Of course these calculations depend on claimed properties of the architecture (e.g., this architecture can tolerate failures of any two FCUs), and design assurance methods (e.g., formal verication) must be employed to justify these claims. The division of labor is that design assurance must justify a theorem of the form enough nonfaulty components implies correct operation and stochastic analysis must justify the antecedent to this theorem. The assumption that failures of separate FCUs are independent must be ensured by careful design and assured by stringent analysis. True independence generally requires that different FCUs are served by different power supplies, and are physically and electrically isolated from each other. Providing this level of independence is expensive and it is generally undertaken only in aircraft applications. In cars, it is common to make some small compromises on independence: for example, the guardians may be fabricated on the same chip as the interface (but with their own clock oscillators), or the interface may be fabricated on the same chip as the host processor. It is necessary to examine these compromises carefully to ensure that the loss in independence applies only to fault modes that are benign, extremely rare, or tolerated by other mechanisms. The fault mode mentioned above is one aspect of a fault hypothesis; the others are the total number of faults, and their rate of arrival. A fault mode describes the kind of behavior that a faulty FCU may exhibit. The same fault may exhibit different modes at different levels of a protocol hierarchy: for example, at the electrical level, the fault mode of a faulty line driver may be that it sends an intermediate voltage (one that is neither a digital 0 nor a digital 1), while at the message level the mode of the same fault may be Byzantine, meaning that different receivers interpret the same message in different ways (because some see the intermediate voltage as a 0, and others as a 1). Some protocols can tolerate Byzantine faults, others cannot; for those that cannot, we must show that the fault mode is controlled at the underlying electrical level. The basic dimensions that a fault can affect are value, time, and space. A value fault is one that causes an incorrect value to be computed, transmitted, or received (whether as a physical voltage, a logical message, or some other representation); a timing fault is one that causes a value to be computed, transmitted, or received at the wrong time (whether too early, too late, or not at all); a spatial proximity fault is one where all matter in some specSpatial proximity fault ied volume is destroyed (potentially aficting multiple FCUs). Bus-based interconnects of the kind shown in Figure 1.2 are vulnerable to spatial proximity faults: all redundant buses necessarily come into close proximity at each node, and general destruction in that space could sever or disrupt them all. Interconnect topologies with a central hub are far 14 more resilient in this regard: a spatial proximity fault that destroys one or more nodes does not disrupt communication among the others (the hub may need to isolate the lines to the destroyed nodes in case these are shorted), and destruction of a hub can be tolerated if there is a duplicate in another location. There are many ways to classify the effects of faults in any of the basic dimensions. Hybrid fault model One classication that has proved particularly effective in analysis of the types of algorithms that underlie the architectures considered here is the hybrid fault model of Thambidurai and Park [TP88]. In this classication, the effect of a fault may be manifest, meaning that it is reliably detected (e.g., a fault that causes an FCU to cease transmitting messages), symmetric, meaning that whatever the effect, it is the same for all observers (e.g., an off-by1 error), or arbitrary, meaning that it is entirely unconstrained. In particular, an arbitrary fault may be asymmetric or Byzantine, meaning that its effect is perceived differently by different observers (as in the intermediate voltage example). The great advantage to designs that can tolerate arbitrary fault modes is that we do not Arbitrary faults have to justify assumptions about specic fault modes: a system is shown to tolerate (say) two arbitrary faults by proving that it works in the presence of two faulty FCUs with no assumptions whatsoever on the behavior of the faulty components. A system that can tolerate only specic fault modes may fail if confronted by a different fault mode, so it is necessary to provide assurance that such modes cannot occur. It is this absence of assumptions that is so attractive in safety-critical contexts about systems that can tolerate arbitrary faults. This point is often misunderstood and such systems are often derided as being focused on asymmetric or Byzantine faults, which never arise in practice. Byzantine faults are just one manifestation of arbitrary behavior, and they certainly cannot be asserted not to occur (in fact, they have been observed in several systems that have been monitored sufciently closely). One situation that is likely to provoke asymmetric manifestations is a slightly out SOS faults of specication (SOS) fault, such as the intermediate electrical voltage mentioned earlier. SOS faults in the timing dimension include those that put a signal edge very close to a clock edge, or that have signals with very slow rise and fall times (i.e., weak edges). Depending on the timing of their own clock edges, some receivers may recognize and latch such a signal, others may not, resulting in asymmetric or Byzantine behavior. FCUs may be active (e.g., a processor) or passive (e.g., a bus); while an arbitraryActive/Passive faults faulty active component can do anything, a passive component may change, lose, or delay data, but it cannot spontaneously create a new datum. Keyed checksums or cryptographic signatures can sometimes be used to reduce the fault modes of an active FCU to those of a passive one. (An arbitrary-faulty active FCU can always create its own messages, but it cannot create messages purporting to come from another FCU if it does not know the key of that FCU; signatures need to be managed carefully for this reduction in fault mode to be credible [GLR95].) Any fault-tolerant architecture will fail if subjected to too many faults; generally speakMaximum number of faults ing, it requires more redundancy to tolerate an arbitrary fault than a symmetric one, which in turn requires more redundancy than a manifest fault. The most effective fault-tolerant al15 gorithms make this tradeoff automatically between number and difculty of faults tolerated. For example, the clock synchronization algorithm of [Rus94] can tolerate a arbitrary faults, s symmetric, and m manifest ones simultaneously provided n, the number of FCUs, satises n > 3a + 2s + m. It is provably impossible (i.e., it can be proven that no algorithm can exist) to tolerate a arbitrary faults in clock synchronization with fewer than 3a + 1 FCUs and 2a +1 disjoint communication paths (or a +1 disjoint broadcast channels) [DHS86, FLM86] (unless digital signatures are employedwhich is equivalent to reducing the severity of the arbitrary fault mode). Synchronization is approximate (i.e., the clocks of different FCUs need to be close together, not exactly the same); those problems that require exact agreement (e.g., group membership, consensus, diagnosis) cannot be solved in the presence of a arbitrary faults unless there are at least 3a + 1 FCUs, 2a + 1 disjoint communication paths (or a + 1 disjoint broadcast channels) between them, and a + 1 levels (or rounds) of communication [Lyn96]. The number of FCUs and the number of disjoint paths required, but not the number of rounds, can be reduced by using digital signatures. Because it is algorithmically much easier to tolerate simple failure modes, some archiSelf-checking tectures (e.g., SAFEbus) arrange FCUs (the BIUs in the case of SAFEbus) in self-checking pairs: if the members of a pair disagree, they go ofine, ensuring that the effect of their failure is seen as a manifest fault (i.e., one that is easily tolerated). The controllers and bus guardians in TTA-bus operate in a similar way. Most architectures also employ substantial self-checking in each FCU; any FCU that detects a fault will shut down, thereby ensuring that its failure will be manifest. (This kind of operation is often called fail silence). Fail silence Even with extensive self-checking and pairwise-checking, it may be possible for some fault modes to escape, so it is generally necessary to show either that the mechanisms used have complete coverage (i.e., there will be no violation of fail silence), or to design the architecture so that it can tolerate the escape of at least one arbitrary fault. Many architectures can tolerate only a single fault at a time, but can recongure to Fault arrival rate exclude faulty FCUs and are then able to tolerate additional faults. In such cases, the fault arrival rate is important: faults must not arrive faster than the architecture can recongure. The architectures considered here operate according to static schedules, which consist of rounds or frames that are executed repeatedly in a cyclic fashion. The acceptable fault arrival rate is often then expressed in terms of faults per round (or the inverse). It is usually important that every node is scheduled to make at least one broadcast in every round, since this is how fault status is indicated (and hence how reconguration is triggered). Algorithms to identify and exclude a faulty component are based on distributed diagnoReconguration sis and group membership. It is provably impossible to correctly identify an arbitrary-faulty FCU in some circumstances [WLS97], so there is tension between leaving a faulty FCU in the system and the risk of excluding a nonfaulty one (and still leaving the faulty one in operation). Most faults are transient, meaning they correct themselves given a little time, so there is tension between the desire to exclude faulty FCUs quickly, and the hope that they may correct themselves if left in operation [Rus96]. A transient fault may contaminate 16 state data in a way that leaves a permanent residue after the original fault has cleared, so mechanisms are needed to purge such effects [Rus93a]. An excluded FCU may perform a restart and self check. If successful, it may then apply to rejoin the system. This is a delicate operation for most architectures, because one FCU may be going faulty at the same time as another (nonfaulty) one is rejoining: this presents two simultaneous changes in the behavior of the system and may cause algorithms tolerant of only a single fault to fail. Historical experience and analysis must be used to show that the hypothesized modes, Never give up numbers, and arrival rate are realistic, and that the architecture can indeed operate correctly under those hypotheses for its intended mission time. But sometimes things go wrong: the system may experience many simultaneous faults (e.g., from unanticipated HIRF), or other violations of its fault hypothesis. We cannot guarantee correct operation in such cases (otherwise our fault hypothesis was too conservative), but safety-critical systems generally are constructed to a never give up philosophy and will attempt to continue operation in a degraded mode. Although it is difcult to provide assurance of correct operation during these events (otherwise we could revise the fault hypothesis), it may be possible to provide assurance that the system returns to normal operation once the faults cease (assuming they were transients) using the ideas of self-stabilization [Sch93]. The usual method of operation in never give up mode is that each node reverts to local control of its own actuators using the best information available (e.g., each brake node applies braking force proportional to pedal pressure if it is still receiving that input, and removes all braking force if not), while at the same time attempting to regain coordination with its peers. 2.2.1 SAFEbus The FCUs of SAFEbus are the BIUs (two per node), the hosts, and the buses (there are two, each of which is a self-checking pair). In addition, two copies of the entire system are located in separate cabinets in different parts of the aircraft. ARINC 659 shows a single host attached to each pair of BIUs [ARI93, Attachment 21], but in the Honeywell implementation these are also paired: each host (called a Core Processing Module, or CPM) is attached to a single BIU, and forms a separate FCU. The fault hypothesis of SAFEbus is the following. Fault modes: Arbitrary active faults in BIUs and CPMs Arbitrary passive faults in the buses Spatial proximity faults that may take out an entire cabinet Maximum faults: SAFEbus adopts a single-fault hypothesis: at most one component of any pair may fail. In more detail, the fault hypothesis of SAFEbus allows the following numbers of faults. 17 At most one of the BIUs in any node (the entire node is then considered faulty) At most one of the CPMs in a node (the entire node is then considered faulty) At most one fault in either of the two buses Any number of nodes may fail, but for an application to be functional, at least one node that supports it must be nonfaulty Fault arrival rate: Able to tolerate any rate 2.2.2 TTA The FCUs of TTA depend on how the system is fabricated. It is anticipated that in highvolume applications, TTP controllers will be integrated on the same chip as the host, so these should be considered to belong to a single FCU. Current implementations of TTPbus have the controller separate from the host, but the bus guardians are on the same chip as the controller. Guardians do, however, have their own clock oscillator, so they can be considered a separate FCU for time-partitioning purposes. TTA-star moves the guardians to the central hub, where they denitely form a separate FCU from the controllers. TTA-bus has two bus lines, which are separate FCUs; in TTA-star, the interconnect functionality is provided by the central hubs (so the bus guardian and interconnect are in the same FCU), which are duplicated. The fault hypothesis of TTA is the following. Fault modes: Arbitrary active faults in controllers and the hub of TTA-star Arbitrary passive faults in the guardians and buses of TTA-bus Spatial proximity faults that may take out nodes and a hub in TTA-star; spatialproximity faults are not part of the fault hypothesis for TTA-bus Maximum faults: TTA adopts a single-fault hypothesis. In more detail, the fault hypothesis of TTA assumes the following numbers of faults. For TTA-bus: in each node either the controller or the bus guardian may fail (but not both). One of the buses may fail. To retain single fault tolerance, at least four controllers and their bus guardians must be nonfaulty, and both buses must be nonfaulty. Provided at least one bus is nonfaulty, the system may be able to continue operation with fewer nonfaulty components. For TTA-star: to retain single fault tolerance, at least four controllers and both hubs must be nonfaulty. Provided at least one hub is nonfaulty, the system may be able to continue operation with fewer nonfaulty components. 18 Fault arrival rate: At most one fault every two rounds 2.2.3 SPIDER The FCUs of SPIDER are the hosts, the BIUs, and the RMUs. The BIUs and RMUs may be distributed, or contained in the hub of the ROBUS. The fault hypothesis of SPIDER is the following. Fault modes: Arbitrary active faults in any FCU Spatial proximity faults that may take out nodes and RMUs (depending on the physical topology employed) Maximum faults: If there are n 3 BIUs and m 3 RMUs, then SPIDER can tolerate an arbitrary fault in any one of these FCUs. SPIDER's maximum faults hypothesis is actually specied with respect to the hybrid fault model, and its constraints are n > 2ba + 2bs + bm, m > 2ra + 2rs + rm, and ba + ra 1, where ba, bs, and bm are the numbers of arbitrary-, symmetric- and manifest-faulty BIUs, and ra, rs, and rm are the corresponding numbers for the RMUs. Fault arrival rate: Within the constraints presented above, SPIDER is able to tolerate multiple simultaneous faults. SPIDER's reconguration mechanisms are not documented at present (although its fault diagnosis is based on algorithms similar to those of [WLS97]). The fault arrival rate hypothesis is a function of the amount of redundancy in the ROBUS, and can be adjusted within certain parameters by employing different numbers of BIUs and RMUs. 2.2.4 FlexRay Published diagrams of FlexRay indicate that a node consisting of a microcontroller host, a communication controller, and two bus guardians will be fabricated on a single chip. It appears that all four components will use separate clock oscillators, so that the controller and guardians can be considered as separate FCUs for time-partitioning purposes. The interconnects, whether passive buses or active stars, are separate FCUs. The fault hypothesis of FlexRay is not stated explicitly; the following are inferences based on available documents. Fault modes: Asymmetric (and presumably, therefore, also arbitrary) faults in controllers for the purposes of clock synchronization 19 Fault modes for other services and components are not described Spatial proximity faults may take out nodes and an entire hub Maximum faults: It appears that a single-fault hypothesis is intended: in each node, at most one bus guardian, or the controller, may be faulty. At most one of the interconnects may be faulty. For clock synchronization, fewer than a third of the nodes may be faulty. Fault arrival rate: The fault arrival rate hypothesis is not described. 2.3 Clock Synchronization Fault-tolerant clock synchronization is a fundamental requirement for a time-triggered bus architecture. Tightness of the bus schedule, and hence the throughput of the bus, is strongly related to the quality of global clock synchronization that can be achievedand this is related to the quality of the clock oscillators local to each node, and to the algorithm used to synchronize them. There are two basic classes of algorithm for clock synchronization: those based on averaging and those based on events. Averaging works by each node measuring the skew between its clock and that of every other node, and then setting its clock to some average value. A simple average (e.g., the mean or median) over all clocks may be affected by wild readings from faulty clocks (which, under an arbitrary fault model, may provide different readings to different observers), so we need a fault-tolerant average that is largely insensitive to a certain number of readings from faulty clocks. Event-based algorithms rely on nodes being able to sense events directly on the interconnect: each node broadcasts a ready event when it is time to synchronize and sets its clock when it has seen a certain number of events from other nodes. Depending on the fault model, additional waves of echo or accept events may be needed to make this fault tolerant. Schneider [Sch87] gives a general description that applies to all averaging clock synchronization algorithms; these algorithms differ only in their choice of fault-tolerant average. The Welch-L ynch algorithm [WL88] is a popular choice that is characterized by use of the fault-tolerant midpoint as its averaging function. We assume n clocks and the maximum number of simultaneous faults to be tolerated is t (3t < n); the fault-tolerant midpoint is the average of the t + 1'st and n t'th clock reading, when these are arranged in order from smallest to largest. If there are at most t faulty clocks, then some reading from a nonfaulty clock must be at least as small as the t + 1'st reading, and the reading from another nonfaulty clock must be at least as great as the n t'th; hence, the average of these two readings should be close to the middle of the spread of readings from good clocks. The most important event-based algorithm is that of Srikanth and Toueg [ST87]; it is attractive because it achieves optimal accuracy. Both averaging and event-based algorithms require at least 3a + 1 nodes to tolerate a arbitrary faults. 20 2.3.1 SAFEbus The SAFEbus bus is quad-redundant (a pair of self-checking pairs) and each of its four components comprises two data lines and a separate clock line. SAFEbus uses the clock lines for an event-triggered clock synchronization algorithm. The schedule loaded in each interface (BIU in SAFEbus terminology) indicates when a synchronization event should be performed, and these must be sufciently frequent to maintain the paired BIUs of each node within two bit-times of each other. In a clock synchronization event, each BIU asserts the clock lines of the two buses that it can write for four bit-times. The electrical characteristics of the SAFEbus cause it to act as an OR gate with the BIUs as its inputs. Thus, the near-simultaneous assertion of each clock line by multiple BIUs generates a pulse on each line that is the OR of its individual pulses. Each BIU synchronizes to the trailing edge of this composite pulse. A faulty BIU could attempt to assert its clock lines for far longer than the specied four bit-times, thereby delaying the trailing edge that is the global synchronization event. The guardian function of its partner BIU will cut it off once the transmit window closes, and all receiving BIUs will count it out after some number of bit-times greater than four, but the synchronization event will still be delayed. However, this fault affects only the two buses driven by the faulty BIU. Each BIU reads all four buses (although it can write only two of them), detects the trailing edge of the composite synchronization pulse on each of them, and then combines these in a fault-tolerant manner [ARI93, Attachment 410] to yield the event to which it actually synchronizes. SAFEbus also applies several other error detection and masking techniques to minimize the impact of faulty clocks, BIUs, and buses: for example, pulses are ignored from buses that have not changed state since the previous synchronization (to overcome stuck-at faults or failed power supplies). There are several variants to the clock synchronization performed by SAFEbus: a Short Resync operates essentially as described above; a Long Resync is similar but provides additional information on the data lines to allow an unsynchronized BIU to rejoin the system; an Initial Sync is used at startup or following a disruption that requires a restart. 2.3.2 TTA The TTA algorithm is basically the Welch-L ynch algorithm specialized for t = 1 (i.e., it tolerates a single fault): that is, clocks are set to the average of the 2nd and n 1'st clock readings (i.e., the second-smallest and second-largest). This algorithm works and tolerates a single arbitrary fault whenever n 4. TTA does not use dedicated wires or signaling to communicate clock readings among the nodes attached to the network; instead, it exploits the fact that communication is time triggered by a global schedule. When a node x receives a message from a node y , it notes the reading of its local clock and subtracts a xed correction term to account for the network delay; the difference between this adjusted clock 21 reading and the time for y 's transmission that is indicated in the global schedule yields x's perception of the difference between clocks x and y . Not all TTP nodes have accurate clock oscillators (because these are expensive); those that do have the SYF eld set in the Message Descriptor List (MEDLthe global schedule known to all nodes) and the clocks used for synchronization are selected from those that have the SYF ag set. For scalability, implementation of the Welch-L ynch algorithm should use data structures that are independent of the value of nthat is, it should not be necessary for each node to store the clock difference readings for all n clocks. Clearly, the t'th smallest clock difference reading can be determined with just t registers, and the t'th largest can be determined similarly, for a total of 2t registers per node. In TTA, with t = 1, this requires four registers. TTA does indeed use four registers, but not in quite this way. Each node maintains a queue of four clock-difference readings; whenever a message is received from a node that is in the current membership and that has the SYF eld set, the clock difference reading is pushed on to the receiving node's queue (ejecting the oldest reading in the queue). When the current slot has the synchronization eld (CS) set in the MEDL, each node runs the synchronization algorithm using the four clock readings stored in its queue. This algorithm is able to tolerate a single arbitrary fault among the four clocks used in synchronization, but TTA is able to tolerate more than a single fault by reconguring to exclude nodes with faulty clocks. This is accomplished by the group membership service: any node with a clock that skews signicantly from the global time will mistime its broadcast so that it occurs partially outside its assigned slot. This will cause its message to be truncated by its guardian, which will cause it to fail checksum and to be rejected by all nonfaulty nodes. The membership algorithm will then exclude this node. Only the clocks of current group members are eligible for use in synchronization, so the clock of the excluded node will not be placed in the clock-difference queue, and that of some other node having the SYF ag will be used instead. 2.3.3 SPIDER SPIDER uses an event-based algorithm similar to that of Srikanth and Toueg, and also inuenced by Davies and Wakerly [DW78] (this remarkably prescient paper anticipated many of the issues and solutions in Byzantine fault tolerance by several years) and Palumbo [Pal96]. The basic design of SPIDER is similar to that of the Draper FTP [Lal86] in that its active components are divided into two classes (BIUs and RMUs) that play slightly different r oles in each of its algorithms. In SPIDER, the RMUs play the part of the interstages in FTP. Its clock synchronization algorithm operates in three phases as follows. 1. Each RMU broadcasts a ready event to all BIUs when its own clock reaches a specied value. 22 2. Each BIU broadcasts an accept event to all RMUs as soon as it has received events from t + 1 RMUs (where t is the number of faults to be tolerated). 3. Each RMU resets its clock as soon as it has received events from t + 1 BIUs. This synchronizes the RMUs; the BIUs can be synchronized by one more wave of events from the RMUs. 2.3.4 FlexRay FlexRay uses the standard Welch-L ynch algorithm, with clock differences presumably determined in a manner similar to TTA. However, FlexRay has no membership service and no mechanisms for detecting faulty nodes, nor for reconguring to exclude them. To tolerate two arbitrary faults (as claimed in published descriptions), FlexRay must therefore employ at least seven nodes with ve disjoint communication paths or three broadcast channels (3t + 1, 2t + 1, and t + 1, respectively, for t = 2), whereas TTA can do this with ve nodes (providing the faults arrive sequentially)and with seven nodes it can tolerate four sequential faults. 2.4 Bus Guardians Some kind of bus guardian functionality is necessary to prevent faulty nodes usurping the scheduled time slots of other nodes or evenin the case of the babbling idiot fault modedestroying all legitimate communication. Bus guardianship depends on message transmission by an interface to an interconnect being mediated by a separate FCU that has an independent copy of the schedule, and independent knowledge of the global time. Such a fully independent guardian is likely to be expensive, however, and equivalent almost to a second interface (as it is in SAFEbus). Most architectures, therefore, seek to reduce the cost of bus guardianship; they do so in different ways, and incur different penalties. 2.4.1 SAFEbus SAFEbus makes no compromises: its BIUs are paired and each member of a pair acts as a guardian for the other. Each BIU performs its own clock synchronization and has its own copy of the schedule. As a result, SAFEbus is expensive: its nodes cost a few hundred dollars each. 2.4.2 TTA In TTA-bus, the guardians have their own clock oscillators and independent copy of the schedule, but they are not able to synchronize independently, and they share the same power 23 supply and physical environment as their controllers. Most of the functionality of a bus guardian is shared across both bus lines. TTA-bus guardians are synchronized by a start-of-round signal received from their controller. If this signal is given at the wrong time, then the guardian will open its window at the wrong time and will allow its (presumably faulty) controller to transmit at that wrong time. However, its transmission will either collide with a legitimate transmission, resulting in garbage, or will hit the slot of an already excluded node, or some other unused part of the frame. In neither case will it be acknowledged, so the errant controller will shut down if it is not too faulty to follow the TTP/C protocol; in any case, the other nodes of the system will exclude both the errant node (since it will have failed to broadcast in its own slot) and the node (if any) whose slot it usurped, and will thereafter proceed unhindered. The errant node will not be able to repeat its trick because a guardian places tight limits on how far the start-of-round signal can move (which is enough to reduce this scenario to extremely low probability in the rst place). In TTA-star, guardian functionality is moved to the central hub. Since there is now only one guardian per interconnect, rather than one per node, more resources can be expended in its construction. The guardian in a hub is able to perform independent clock synchronization and is therefore a fully independent FCU, and provides full coverage against clock synchronization faults and babbling fault modes in all controllers. The penalty is that a central hub is a single point of failure, and as an active entity, it is probably less reliable than a passive bus. This penalty is overcome by duplication, which has the concomitant benet of providing tolerance for spatial proximity faults. 2.4.3 SPIDER The scheduling aspects of SPIDER are not well documented as yet, but the bus guardian functionality is handled in the RMUs following an approach due to Palumbo [Pal96]. 2.4.4 FlexRay Operation of the bus guardians of FlexRay is not described in any detail in the available documents. Published diagrams show two guardians per node sharing the same chip and power supply as the controller. These guardians presumably operate in a manner similar to those of TTA-bus, with similar vulnerabilities, though at greater cost (since two guardians per node presumably require two oscillators). 2.5 Startup and Restart It obviously is necessary to be able to start up the bus and its associated components from coldpreferably with no outside assistance. Modern aircraft systems generally have enough redundancy to operate for several days with some components failed or faulty 24 [Hop88]. This allows repairs to be deferred until the aircraft's schedule brings it to a major maintenance site. Car owners are notorious for deferring maintenance and for operating their vehicles with faults present. These characteristics make it necessary that startup can be performed correctly in the presence of faults. Restart during operation may be necessary if HIRF or other environmental inuences lead to violation of the fault hypothesis and thereby cause complete failure of the bus. Notice that this failure must be detected by the bus, and the restart must be automatic and very fast: most control systems can tolerate loss of control inputs (e.g., by reverting to some form of local control and either releasing the actuators, or freezing them in the previously commanded position) for only a few cycles: longer outages will lead to loss of control. For example, Heiner and Thurner of DaimlerChrysler estimate that the maximum transient outage time for a steer-by-wire automobile application is 50 ms [HT98]. Given that other (e.g., host-level) activities may need to be performed on restart, this suggests 10 ms as a reasonable goal for bus restart. The presence of faulty components could complicate, or even prevent restart (e.g., if multiple faults are present, but some of the algorithms can tolerate only single faults), so it is desirable that previous recongurations (e.g., that excluded those faulty components) should be recorded in a way that makes this information available during restart. Restart is usually initiated when an interface detects no activity on any bus line for some interval; that interface will then transmit some wake up message on all lines. Of course, it is possible that the interface in question is faulty (and there was bus activity all along but that interface did not detect it), or that two interfaces decide simultaneously to send the wake up call. The rst possibility must be avoided by careful checking, preferably by independent units (e.g., both interfaces of a pair, or an interface and its guardian); the second requires some form of collision detection and resolution: this should be deterministic to guarantee an upper bound on the time to reach resolution (that will allow a single interface to send an uninterrupted wake up message) and, ideally, should not depend on reliable collision detection (because there is no such thing). Components that detect faults in their own operation, or that are notied of their faulty operation by others (e.g., through failed comparison with a paired component, or by exclusion from the group in a system that employs group membership) may drop off the bus and undergo local restart and self-test. If the test is successful (i.e., the fault was transient), then the component will attempt to reintegrate itself into the ongoing operations of the bus. This is a delicate operation because the sudden arrival of a new participant in the bus trafc can present symptoms rather like a faultand can be particularly challenging to handle if a real fault is manifested simultaneously, or if another component rejoins at the same time. Restart of the whole bus, and reintegration of individual components, can be interpreted as self-stabilizing services: self-stabilizing algorithms are those that converge to a stable state from any initial state [Dij74, Sch93]. Such algorithms generally assume that no faults are present (or that transient faults have ceased) once stabilization begins; in the circumstances considered here, however, it is possible that some residual faults remain during sta25 bilization. Thus, the algorithms employed for restart and reintegration in bus architectures do not make explicit reference to self-stabilization, although this may provide an attractive framework for their formal analysis. Frameworks that integrate self-stabilization with fault tolerance have been proposed [AG93, GP93] that may provide a useful foundation for this endeavor. 2.5.1 SAFEbus A SAFEbus node in the out-of-sync state listens on the bus for a Long Resync; if it nds one, it uses the information in that message to integrate itself into the ongoing activity of the bus. If a Long Resync is not detected for a certain length of time, the node transmits an Initial Sync message on all buses (note that both BIUs in the node must agree on this action). Due to the OR gate character of the SAFEbus lines, and the coding used for the Initial Sync message, it does no harm if several nodes attempt to send this message nearly simultaneously. After sending or receiving an Initial Sync message, a node waits a specied amount of time and then sends a Long Resync message; all nodes should be reintegrated and the bus restarted at this point. SAFEbus uses the same mechanisms for cold start and restart; these are very fast, as nodes will send Initial Sync messages after a timeout that is little longer than a single round of the cyclic schedule, and the bus will be synchronized and operational in the round after that. Reintegration is even faster, as the reintegrating node need wait only for a Long Resync to be sent, and each node initiates at least one of these per round. The SAFEbus mechanisms are fully decentralized and very robust (e.g., they do not depend on collision detection). 2.5.2 TTA TT A 's mechanisms for cold start, restart, and reintegration are conceptually similar to those of SAFEbus, but cannot use the electrical properties of the bus (because TTA operates above this level, and can be used with any transmission technology) and are therefore rather more complex. In addition, TTA uses distributed group membership information, and it is necessary to initialize or update this consistently. A TTA controller that has been excluded from the membership, or that has recently restarted, listens to activity on the bus until it recognizes an I-Frame message. These are broadcast periodically, and contain sufcient information to initialize the clock and membership vector of the controller. The controller then observes the bus activity for a further round to check that its C-State (i.e., the controller state information that is encoded in the CRCs attached to each message) is consistent with the other controllers on the bus, and thereafter resumes normal operation. If no I-Frame is detected, the controller will transmit one itself after a certain interval, but on only one of the two buses (presumably this limitation to a single bus is intended to 26 limit the harm caused by a faulty controller that attempts to cold start an already-functioning bus). The membership indicated in this I-Frame will contain only the controller that sends it. Other controllers that receive this I-Frame will synchronize to it and start executing the schedule indicated in their MEDLs. During the subsequent round of messages, each controller will add others to its membership when it observes their broadcasts. All nodes should be fully integrated by the end of the rst round following the cold start I-Frame. Some ambiguities in the description of the state-machine that species these transitions [TTT99, Section 9.3] were identied in simulation by Bradbury [Bra00]. It is possible that two controllers could send a cold start I-Frame at the same time, resulting in a collision on the bus. This should cause no recognizable message to be received; the two initiating controllers will have different timeouts, so their subsequent attempts will not collide, and one of them will succeed in starting the bus (modulo the possibility of collisions with other controllers, which are resolved in the same way). The danger is that the colliding messages may not be received the same everywhere (they will be traveling down the bus from different sources, at nite speed), and that some nodes will receive one or other of the messages, while others receive an invalid message. One proposed solution to this danger is for nodes sending cold start I-Frames always to act as if a collision had occurred. A related problem can arise because the cold start I-Frame is sent on only a single bus, where SOS or other faults may cause asymmetric reception. Here, as in the case of asymmetric collision detection, it is possible for some controllers to synchronize to a cold start I-Frame, while others do notthe latter may subsequently synchronize to a different initial I-Frame, resulting in two coexisting cliques. A proposed solution for this case is to send cold start I-Frames on both buses, and to deal with faulty transmission of cold start frames in the bus guardian of the hub. Another problem can arise because there are no checks on the content of the cold start I-Frame. A faulty controller could provide bad data (e.g., an undened mode number or MEDL position) and thereby cause good receivers to detect errors and shut down. Proposed solutions to this problem include more checking by the bus guardian of a central hub, or a more truly distributed start/restart algorithm. Recent analysis has exposed the problems in TTA startup and restart described above. These can arise only in highly unusual circumstances, and are being addressed in the design of the new TTA-star conguration. 2.5.3 SPIDER These aspects of SPIDER are not documented as yet. 2.5.4 FlexRay As described below in Section 2.7.4, FlexRay differs from SAFEbus and TTA in that the full schedule for the system is not installed in each node during construction. Instead, each 27 node is initialized only with its own schedule, and learns the full conguration of the system by observing message trafc during startup. This seems vulnerable to masquerading faults. The method for initial synchronization of clocks in FlexRay is not described. It is difcult to initialize the Welch-L ynch algorithm if faults are present at startup: [Min93] describes scenarios that lead to independent cliques. It seems that TTA 's clique-avoidance protocol will rescue it from these scenarios, but in the absence of such a mechanism, it is not clear how FlexRay can do so. There are clock synchronization algorithms that self-stabilize in the presence of faults (e.g., [DW95]), but these are complex, or rely on randomization. Randomization is generally considered unacceptable in a safety-critical system because correct operation is only probabilistically guaranteed, but it may be acceptable during startup (though recall the failure of the rst attempt to launch the Space Shuttle [SG84]) or as part of a never give up strategy. 2.6 Services The essential basic purpose of these architectures is to make it possible to build reliable distributed applications; a desirable purpose is to make it straightforward to build such applications. The basic services provided by the bus architectures considered here comprise clock synchronization, time-triggered activation, and reliable message delivery. Some of the architectures provide additional services; their purpose is to assist straightforward construction of reliable distributed applications by providing these services in an applicationindependent manner, thereby relieving the applications of the need to implement these capabilities themselves. Not only does this simplify the construction of application software, it is sometimes possible to provide better services when these are implemented at the architecture level, and it is also possible to provide strong assurance that they are implemented correctly. Applications that perform safety-critical functions must generally be replicated for fault tolerance. There are many ways to organize fault-tolerant replicated computations, but a basic distinction is between those that use exact agreement, and those that use approximate agreement. Systems that use approximate agreement generally run several copies of the application in different nodes, each using its own sensors, with little coordination across the different nodes. The motivation for this is a folk belief that it promotes fault tolerance: coordination is believed to introduce the potential for common mode failures. Because different sensors cannot be expected to deliver exactly the same readings, the outputs (i.e., actuator commands) computed in the different nodes will also differ. Thus, the only way to detect faulty outputs is by looking for values that differ by a lot from the others. Hence, these systems use some form of selection or threshold voting to select a good value to send to the actuators, and similar techniques to identify faulty nodes that should be excluded. Brilliant, Knight and Leveson describe some of the difculties with this approach in the context of N -version programming [BKL89]. The most troublesome of these for applications of the kind considered here is that hosts accumulate state that diverges from that of 28 others over time (e.g., velocity and position as a result of integrating acceleration), and they execute mode switches that are discrete decisions based on local sensor values (e.g., change the gain schedule in the control laws if the altitude, or temperature, is above a specic value). Thus, small differences in sensor readings can lead to major differences in outputs and this can mislead the approximate selection or voting mechanisms into choosing a faulty value, or excluding a nonfaulty node. The x to these problems is to attempt to coordinate discrete mode switches and periodically to bring state data into convergence. But these xes are highly application specic, and they are contrary to the original philosophy that motivated the choice of approximate agreementhence, there is a good chance of doing them wrong. There are numerous examples that justify this concern; several that were discovered in ight tests are documented by Mackall and colleagues [IRM84, Mac85, Mac88] and summarized in [Rus93b, Section 3.3]. The essential points of Mackall's data is that all the failures observed in ight test were due to bugs in the design of the fault tolerance mechanisms themselves, and all these bugs could be traced to difculties in organizing and coordinating systems based on approximate agreement. Systems based on exact agreement face up to the fact that coordination among replicated computations is necessary, and they take the necessary steps to do it right. If we are to use exact agreement, then every replica must perform the same computation on the same data: any disagreement on the outputs then indicates a fault; comparison can be used to detect those faults, and majority voting to mask them. A vital element in this approach to fault tolerance is that replicated components must work on the same data: thus, if one node reads a sensor, it must distribute that reading to all the redundant copies of the application running in other nodes. Now a fault in that distribution mechanism could result in one node getting one value and another a different one (or no value at all). This would abrogate the requirement that all replicas obtain identical inputs, so we need to employ mechanisms to overcome this behavior. The problem of distributing data consistently in the presence of faults is variInteractive consistency ously called interactive consistency, consensus, atomic broadcast, or Byzantine agreement [PSL80, LSP82]. When a node transmits a message to several receivers, interactive consistency requires the following two properties to hold. Agreement: All nonfaulty receivers obtain the same message (even if the transmitting node is faulty). Validity: If the transmitter is nonfaulty, then nonfaulty receivers obtain the message actually sent. Algorithms for achieving these requirements in the presence of arbitrary faults necessarily involve more than a single data exchange (basically, each receiver must compare the value it received against those received by others). It is provably impossible to achieve interactive consistency in the presence of a arbitrary faults unless there are at least 3a + 1 FCUs, 2a + 1 disjoint communication paths (or a + 1 disjoint broadcast channels) between them, 29 and a + 1 levels (or rounds) of communication. Some of the parameters, but not the number of rounds required, can be reduced by using digital signatures. The problem might seem moot in architectures that employ a physical bus, since a bus surely cannot deliver values inconsistently (so the agreement property is achieved trivially). Unfortunately, it canthough it is likely to be a very rare event. The scenarios involving SOS faults presented earlier exemplify some possibilities. Dealing properly with very rare events is one of the attributes that distinguishes a design that is t for safety-critical systems from one that is not. It follows that either the application software must perform interactive consistency for itself (incurring the cost of n 2 messages to establish consistency across n nodes in the presence of a single arbitrary fault), or the bus architecture must do it, or the bus architecture must eliminate the fault modes that necessitate multiple rounds of information exchange (so that consistency is achieved by a simple broadcast). The rst choice is so unattractive that it vitiates the whole purpose of a fault-tolerant bus architecture, and the second is described below in separate subsections for those architectures that provide it. The third choice hinges on elimination of asymmetric transmissions Signal reshaping (i.e., those that appear as one value to some receivers, and as different values, or the absence of values, to others). As noted, SOS faults are among the most plausible sources of asymmetric transmissions. SOS faults that cause asymmetric transmissions can arise in either the value or time domains (e.g., intermediate voltages, or weak edges, respectively). In those architectures that employ a bus guardian in series with an interface, the bus guardian is a possible point of intervention for the control of SOS faults: a suitable guardian can reshape, in both value and time domains, the signal sent to it by the controller. Of course, the guardian could be faulty and may make matters worseso this approach makes sense only when there are independent guardians on each of two (or more) replicated interconnects. Observe that for credible signal reshaping, the guardian must have a power supply that is independent of that of the controller (faults in power supply are the most likely cause of intermediate voltages and weak edges). Interactively consistent message broadcast provides the foundation for fault tolerance Voting based on exact agreement. There are several ways to use this foundation. One arrangement, confusingly called the state machine approach [Sch90], is based on majority voting: application replicas run on a number of different nodes, exchange their output values, and deliver a majority vote to the actuators. 1 This approach was rst developed by the SIFT project at SRI [WLG + 78]. Usually, selected intermediate state values are voted as well as outputs (this promotes recovery from transients [Rus93a]), and the architecture can assist these activities by providing services that make the distribution and selection of voted values transparent to the application programs. Another arrangement is based on self-checking (either by individuals or pairs) so that Master/shadow 1 There is often another round of voting performed directly by the actuators, through some form of physical force-summing. For example, outputs of different nodes may energize separate coils of a single solenoid, or multiple hydraulic pistons may be linked to a single shaft. 30 faults result in fail-silence. This will be detected by other nodes, and some backup application running in those other nodes can take over. The architecture can assist this master/shadow arrangement by providing services that support the rollover from one node to another. One such service automatically substitutes a backup node for a failed master (both the master and the backup occupy the same slot in the schedule, but the backup is inhibited from transmitting unless the master has failed). A variant has both master and backup operating in different slots, but the backup inhibits itself unless it is informed that the master has failed. A further variation, called compensation, applies when different nodes have access Compensation to different actuators: none is a direct backup to any other, but each changes its operation when informed that others have failed (an example is car braking: separate nodes controlling the braking force at each wheel will redistribute the force when informed that one of their number has failed). The variations on master/shadow described above all depend on a failure notication, Group membership or equivalently a membership service. The crucial requirement on such a service is that it must produce consistent knowledge: that is, if one nonfaulty node thinks that a particular node has failed, then all other nonfaulty nodes must hold the same opinionotherwise, the system will lose coordination, with potentially catastrophic results (e.g., if the nodes controlling braking at different wheels make different adjustments to their braking force based on different assessments of which others have failed). Notice that this must also apply to a node's knowledge of its own status: a nave view might assume that a node that is receiving messages and seeing no problems in its own operation should assume it is in the membership. But if this node is unable to transmit, all other nodes will have removed it from their memberships and will be making suitable compensation on the assumption that this node has entered its blackout mode (and is, for example, applying no force to its brake). It could be catastrophic if this node does not adopt the consensus view and continues operation (e.g., applying force to its brake) based on its local assessment of its own health. A membership service operates as follows. Each node maintains a private membership list, which is intended to comprise all and only the nonfaulty nodes. Since it can take a while to diagnose a faulty node, we have to allow the common membership to contain at most one faulty node. Thus, a membership service must satisfy the following two requirements. Agreement: The membership lists of all nonfaulty nodes are the same. Validity: The membership lists of all nonfaulty nodes contain all nonfaulty nodes and at most one faulty node. These requirements can be achieved only under benign fault hypotheses (it is provably impossible to diagnose an arbitrary-faulty node with certainty). When unable to maintain accurate membership, the best recourse is to maintain agreement, but sacrice validity (nonfaulty nodes that are not in the membership can then attempt to rejoin). This weakened requirement is called clique avoidance. Note that it is quite simple to achieve consistent membership on top of an interactively consistent message service: each node broadcasts 31 its own membership list to every other node, and each node runs a deterministic resolution algorithm on the (identical, by interactive consistency) lists received. It is much more difcult to achieve consistent membership in the absence of an interactively consistent message service (and will require multiple rounds of message exchange). 2.6.1 SAFEbus SAFEbus provides two important services: interactively consistent message transmission, and automatic master/shadow rollover. Messages from one host to another traverse two BIUs on their way to the four separate buses, and two more BIUs on their way to a receiving host. Although not required by the ARINC 659 standard, Honeywell's implementation of SAFEbus has an additional path for cross-comparison of messages between the receiving BIUs. This additional crosscomparison is crucial to SAFEbus's ability to provide interactive consistency. SAFEbus allows as many as four different nodes to occupy a single slot managed as a master and as many as three shadows. If the master fails to send an expected message, then its rst shadow will take over the slot within a few bit times, and so on down to the third shadow. A faulty BIU in a shadow cannot usurp its master's slot inappropriately because it will be inhibited by the guardian function of its partner BIU. The interactively consistent message transmission provided by SAFEbus ensures that the master and shadows will all have seen an identical history of messages and can therefore provide seamless transfer of function (masters and shadows may not have the same state, since some shadows may be specied to provide degraded functionality). SAFEbus provides a special class of messages that allow masters to communicate specically with their shadows. In Honeywell's implementation, each node has a pair of hosts (CPMs) that crosscompare and fail silent on disagreement; the BIUs do the same (in any implementation), so the whole node is fail-silent. In this context, master/shadow rollover is an effective and straightforward way to provide high availability and is preferred to other fault-masking methods such as majority voting. SAFEbus does not employ membership, but nodes read their own transmissions on the bus just as other receivers do and are therefore quickly able to detect if they have suffered a transmission fault, and can make suitable compensation, if necessary. Most SAFEbus applications rely on self-checking, fail-silence, and master/shadow rollover to provide fault tolerance, and do not require a membership service. It is, however, quite straightforward to implement such a service, if required, at the application levelgiven the underlying support for interactively consistent message transmission. 2.6.2 TTA Interactive consistency requires that any transmission results in identical reception at all receivers. Unlike SAFEbus, TTA does not have enough independent signal paths and rounds of voting to achieve this property directly, but it achieves it indirectly in a very clever way. 32 TTA uses high-grade checksums that can be considered equivalent to digital signatures. This eliminates the possibility that different recipients obtain different values from a single broadcast, leaving only a residual asymmetry between those that do receive a value and those that do not. This weak consistency is preferable to asymmetric reception, but is inadequate for the construction of consistent replicas and backups. Simple acknowledgments are insufcient to x the difculty, because they may be lost or received asymmetrically also. TTA, however, provides a property called clique avoidance as part of its membership service, and this is equivalent to a consistent acknowledgment: only those nodes that receive a message, or those that did not (whichever is in the majority) will remain in the membership [BP00, KBP01]. In common cases (where just one newly faulty node fails to receive, or a newly faulty sender fails to transmit), clique avoidance is achieved by the standard membership function and excludes exactly the faulty node. In rare cases, where there are multiple faults, or an asymmetric fault, the behavior of the clique-avoidance protocol is sound (the survivors have consistent state), but may be Draconian, in that nonfaulty processors may be removed from the membership (though they may rejoin at the next round). The combination of checksummed transmissions and clique avoidance provides a form of interactive consistency related to Crusader Agreement [Dol82] and to Weak Byzantine Agreement [Lam83]. Crusader agreement is similar to Byzantine agreement (i.e., interactive consistency) except that when the transmitter is faulty, it is acceptable for some receivers not to agree with others on the value transmitted, provided they explicitly know that the transmitter is faulty. In the Draconian agreement achieved by TTA, this explicit knowledge is achieved through the clique-avoidance protocol and is associated with exclusion from the group. In weak Byzantine agreement, receivers may agree on any value whenever there is a fault (not just a fault in the transmitter). In the agreement achieved by TTA, the surviving clique may agree in (incorrectly) ascribing no value received to a nonfaulty transmitter (and hence excluding it from the membership) when there are multiple faults within two rounds. The clique-avoidance component of TTA 's group membership protocol (and hence the Draconian form of agreement) engages only when there are multiple faults within two rounds, or an asymmetric transmission; in all other circumstances, TTA 's combination of checksummed transmissions and group membership provides interactive consistency of the classical form. The Draconian behavior is acceptable (and is the price paid for using fewer communication paths and messages than required for full interactive consistency) but undesirable, and its occurrences should be minimized. We cannot do much about multiple fault arrivals in a short interval, but asymmetric transmissions are most plausibly the consequences of SOS faults, and so it is these that we should seek to minimize. As noted earlier, it is difcult to control these faults with bus guardians that are integrated with the controllers, so this is a reason for preferring the TTA-star architecture to TTA-bus. The TTA membership and clique-avoidance service not only supports a form of interactive consistency, but the membership information can be used to organize various master/shadow or compensation strategies for fault tolerance at the application level. TTA 33 provides explicit support for shadow nodes, which occupy the same slots in the schedule as their master and can read all bus trafc, but cannot transmit until the master has failed. The membership service is also exploited internally by TTA, to allow it to operate in the presence of multiple faulty clocks (synchronization is performed only over nodes that are in the membership). A proposed extension to TTA is a service that supports state machine replication in a transparent way [KB00]. The idea is to identify some of the state variables of an application as ones that should be voted. Exchange and voting of those variables is then managed by the TTA controllers in a way that is transparent to the application. This is accomplished by locating the voted variables in that area of memory that the host shares with its controller (hosts and controllers interface through dual-ported RAM). The application reads and writes those variables in the usual way; behind the scenes, multiple instances of the application will be running on different hosts; their controllers broadcast the values of voted variables to each other (exploiting the interactively consistent broadcasts provided by TTA), and replace their local copies by majority-voted versions. The attraction of this service is that it is truly transparent to the application: neither its function nor its timing is changed by the decision to make it fault tolerant using state machine replication. 2.6.3 SPIDER As described in Section 2.3.3, the arrangement of BIUs and RMUs in the hub of SPIDER's ROBUS is similar to that of hosts and interstages in the Draper FTP. The motivation for the architecture of FTP (and possibly of SPIDER) was the desire to achieve interactive consistency with only three full processorssince this is all that is required to tolerate (using TMR) the failure of a host processor running the actual application. The problem, of course, is that interactive consistency requires at least four participants to tolerate a single arbitrary fault. The architecture adopted in FTP overcomes this limitation by adding three impoverished processors (these are the interstages) that act rather like mirrors. The processors and interstages comprise six FCUs, so interactive consistency is feasible in theoryand it is achieved in practice by a very clever algorithm whose correctness has been formally veried [LR94]. The interactive consistency algorithm of SPIDER is similar to that of FTP (with RMUs taking the part of the interstages). The algorithm operates as follows. A host sends its value to its BIU. The BIU broadcasts this value to all RMUs. RMUs broadcast the value received to all BIUs. Each BIU performs a hybrid-majority vote on the values received and forwards the winner to its host. (A hybrid-majority vote is one from which manifestly bad values are excluded.) 34 This differs from the FTP algorithm in that the broadcast to all RMUs in the second step replaces (what would be in SPIDER terminology) a BIU-to-BIU broadcast and a BIU-toown-RMU transfer. 2.6.4 FlexRay In contrast to the other architectures considered, FlexRay provides no services beyond clock synchronization and reliable (best efforts) message delivery. In particular, FlexRay does not provide interactively consistent message transmission (nor even weakly consistent), provides no membership or failure notication service, and contains no mechanisms to control SOS faults. 2.7 Flexibility The static schedule of a time-triggered bus is rather inexible, so some bus architectures make provision for switching between different schedules at startup or during operation. The different schedules may be optimized for different missions, or phases of a mission (e.g., startup vs. cruise), for operating in a degraded mode (e.g., when some major function has failed), or to accommodate optional equipment (e.g., for cars with and without traction control). It is necessary to protect against inappropriate schedule switches (or switches initiated by faulty nodes), so some kind of voting is usually employed. The physical wires or optical cables routed around an aircraft or car represent signicant costs (e.g., in material, installation, maintenance, and weight) and there is strong interest in minimizing the number that are used. Some of the purposes typically performed by an event-driven bus such as CAN can be taken over by a time-triggered bus in a straightforward way; for other purposes, however, the exible resource allocation of an event-driven bus is considered a necessity, so some way must be found to provide this capability within a timetriggered bus if this is to completely subsume existing event-triggered buses. The different buses approach the issue of exibility in very different ways. 2.7.1 SAFEbus A SAFEbus schedule (called a table) may be comprised of several frames; each frame is a self-contained description of the allocation of messages to time slots. Only one frame may be active at any one time. Slots allocated to Long Resync messages may be marked as allowing a frame change. In this case, the BIU that sends the Long Resync message in that slot indicates the new frame that is to be used. The old and new frames begin with different patterns of Short and Long Resync messages so that any receivers that enter the wrong frame (e.g., due to an asymmetric transmission fault) will fail to synchronize and drop off the bus (this is rather like the clique-avoidance protocol of TTA). Frames can exist in several 35 versions: the version of the new frame is also indicated in the Long Resync message, and any node whose table memory contains a different version will drop off the bus. Multiple frames provide some exibility to choose among different modes of behavior, but it seems that the capability is used only at a very coarse level: for example, a table may have frames for hardware initialization, software initialization, self-test, and ight. The ight frame will contain no frame change commands: once entered, it cannot be left. SAFEbus provides no mechanisms for event-triggered behavior, but its BIUs are connected to an IEEE 1149.5 bus for test and maintenance purposes. 2.7.2 TTA TT A schedules are precomputed and loaded into a data structure called the Message Descriptor List (MEDL) present in each controller. A limited form of the same data is present in each bus guardian. There is considerable exibility in selection of the number and length of messages each node may transmit in each cycle, but the selection is xed once it is loaded into the MEDL. TTA checks that all nodes have the same MEDL version during startup. The MEDL may allow certain nodes to request mode changes at certain points. A requested mode change may be either immediate or deferred; if the latter, nodes that transmit later in the cycle have the opportunity to override the mode change request, which occurs at the end of the cycle. All modes are based on the same schedule (so the bus guardians are not affected by mode changes): all that changes are the recipients and intended interpretation of the messages that are sent. Space for diagnostic and other event data may be set aside in each message and used in some application-specic manner. This allows each node a xed bandwidth for event data. A variation is for each node to interpret this data as a simulation of the trafc on a shared event-triggered bus. It is proposed to use this approach to provide a simulation of CAN; later versions of TTA are so fast that it is calculated that the simulation can go faster than a real CAN bus while absorbing only a small fraction of the TTA bandwidth. Observe that this approach brings all the safety attributes of TTA to the simulated eventtriggered bus: if one node manifests faulty behavior on the simulated bus, the other nodes can remove it from their membershipthis ability to detach a faulty node from the simulated bus is beyond the capability of any real event-triggered bus. Although it appears that TTA can perform the r ole of a CAN bus in addition to its own, a full car or aircraft system is likely to need additional buses for secondary control functions (to say nothing of those for entertainment). For example, a car door contains motors and primitive controllers for the window, lock, and mirror; even a CAN bus provides excessive functionality and costs too much to connect each of these devices separately. Consequently, ultra-low-cost buses are emerging that can connect smart sensors and actuators to a gateway on a more muscular bus. These low-cost buses must operate with extremely primitive controllers that lack even a clock oscillator. TTA incorporates this kind of service through the TTP/A protocol [KHE00]. 36 2.7.3 SPIDER These elements of SPIDER are not yet developed. 2.7.4 FlexRay FlexRay aims to be more exible than the other buses considered here, and this seems to be reected in the choice of its name. FlexRay partitions each time cycle into a static time-triggered portion, and a dynamic event-triggered portion. The division between the two portions is set at design time and loaded into the controllers and bus guardians. Nodes communicate using the Byteight protocol during the event-driven portion of the cycle. A similar consortium to FlexRay has developed the Local Interconnect Network (LIN) protocol [LIN00] and this is presumably used to provide a low-cost sensor bus in association with FlexRay (similar to TTP/A for TTA). Unlike SAFEbus and TTA, FlexRay does not install the full schedule for the timetriggered portion in each controller. Instead, this portion of the cycle is divided into a number of slots of xed size, and each controller and its bus guardians are informed of which slots are allocated to their transmissions. Nodes requiring greater bandwidth are assigned more slots than those that require less. Each controller learns the full schedule only when the bus starts up. Each node includes its identity in the messages that it sends; during startup, nodes use these identiers to label their input buffers as the schedule reveals itself (e.g., if the messages that arrive in slots 1 and 7 carry identier 3, then all nodes will thereafter deliver the contents of buffers 1 and 7 to the task that deals with input from node 3). There is an obvious vulnerability here: a faulty node could masquerade as another (i.e., send a message with the wrong identier) during startup and thereby violate partitioning for the remainder of the mission. It is not clear how this fault mode is countered. Neither is it clear how conguration errors, in which two nodes are allocated to the same slot, are detected during startup. (Presumably the message received in that slot will be garbled by the collision and will fail checksum, but then what?) 2.8 Assurance Safety-critical systems must be furnished with strong assurance that they are t for their purpose. Regulation and certication establish requirements for assurance in some application areas (e.g., [RTC92, RTC00] for airborne software and hardware, respectively, and [MIS94] for cars). Assurance is generally achieved by a combination of testing the actual artifact, analysis and review of its design, and scrutiny of its design process. Since safety-critical systems, such as the bus architectures considered here, must be fault tolerant, some of the testing will involve fault injection. However, testing and fault injection can provide direct assurance only to failure rates of about 10 4 or 10 5 per hour, which are far short of 37 those required for safety-critical applications. The remaining assurance must be derived by analysis of the system's design. Formal methods can assist in this process. In formal methods, a mathematical model is constructed of key elements of the system's operation (e.g., its clock synchronization algorithm), and mechanized calculation is used to demonstrate that it meets its requirements, under its specied assumptions. The appropriate branch of applied mathematics for modeling discrete algorithms (whether they are destined for software or hardware implementation) is formal logic, and calculation is performed by the methods of automated deduction, such as theorem proving or model checking. One attribute that renders formal methods particularly attractive in this domain is that it allows all behaviors of fault-tolerant algorithms to be examined through logical case analysis; this is especially powerful when considering arbitrary fault modes, because unlike explicit testing or simulation, we do not have to particularize the notion of arbitrary but can leave it totally unconstrained. 2.8.1 SAFEbus SAFEbus was approved by the FAA as part of the certication for the Boeing 777 (the FAA certies only complete aircraft, not components), and is a ight-critical part of every 777, whose commercial deliveries began in May 1995. Details of the assurance processes used have not been published, but must have been extensive, and are now supported by substantial eld experience, with no failures recorded. 2.8.2 TTA TT A implementations have been subjected to extensive fault-injection experiments in the context of the FIT project of the European Union, and evaluated in full-scale experimental applications developed by DaimlerChrysler and several other automobile companies and their suppliers. Aircraft engine controllers and cockpit automation systems under development by Honeywell will be certied under FAA requirements. The basic Welch-L ynch clock synchronization protocol employed in TTA has been formally veried by Miner [Min93] and by Schwier and von Henke [SvH98]. The actual TTA protocol has been formally veried by Pfeifer, Schwier, and von Henke [PSvH99]. A new verication is planned (by me) that will extend the analysis beyond the standard fault hypothesis of TTA using a hybrid fault model developed by Schmid [Sch00]. The membership and clique-avoidance protocol of TTA has been formally veried by Pfeifer [Pfe00], but only under the standard fault hypothesis of TTA. Formal verication of its properties in the presence of asymmetric transmissions, and fault numbers and arrival rates beyond those of the fault hypothesis, is in progress. Some of these properties have already been veried by traditional (i.e., not mechanically checked) mathematical proofs [BP00, Mer01]. Among the simpler properties of TTA, the timing rules for controllers and bus guardians have been formally veried [Rus01], and mutual exclusion on the bus has been examined by model 38 checking [MMP99]. The main remaining challenges are formally to verify the properties of startup, restart, and reintegration, and to compose the many separate analyses into a single verication of the integrated TTP/C protocol. 2.8.3 SPIDER The interactive consistency algorithm of SPIDER has been formally veried by Miner (it is similar to that previously performed for the Draper FTP architecture [LR94]). Its diagnosis algorithm also has recently been formally veried by Geser; it is similar to the algorithms developed for MAFT [KWFT88] whose verication is described by Walter, Lincoln, and Suri [WLS97]. Formal verication of the SPIDER clock synchronization algorithm is in progress. One of the main goals of the SPIDER project is to serve as a demonstration study for certication under the DO-254 guidelines for airborne hardware [RTC00]. 2.8.4 FlexRay FlexRay is still under development. The only assurance technique so far described is a simulation model being developed by Motorola. As noted above, the basic Welch-L ynch clock synchronization algorithm has been formally veried. However, FlexRay documents speak of the system being operational as soon as two clocks synchronize. This is outside the parameters of the formal analyses, which would need to be revisited and extended to cover the cases n = 2 and n = 3. Furthermore, there are known pathologies in initialization of the Welch-L ynch algorithm in the presence of faults that can lead to clique formation [Min93]. It is not described how FlexRay avoids these, and verication of its startup mechanisms could be very challenging. The other protocols of FlexRay are not described in sufcient detail to assess the feasibility of their formal verication. 39 40 Chapter 3 Conclusion The four buses considered here provide different solutions to very similar sets of requirements. All provide fault-tolerant, distributed clock synchronization, and support the timetriggered model of computation. They differ in their fault hypotheses, mechanisms, services, assurance, performance, and cost. SAFEbus is the most mature of the four, and makes the fewest compromises. It employs paired bus interface units, with each member of a pair acting as a bus guardian for the other, and paired, self-checking buses. Its fault hypothesis includes arbitrary faults, faults in several nodes (but only one per node), and a high rate of fault arrivalsand it never gives up. It tolerates spatial proximity faults by duplicating the entire system. It provides interactively consistent message broadcasts (in the Honeywell implementation), and supports application-level fault tolerance (based on self-checking pairs) by providing automatic rapid rollover from masters to shadows. It is certied for use in passenger aircraft and has extensive eld experience as the backbone for the integrated avionics on the Boeing 777. The Honeywell implementation is supported by an in-house tool chain. The raw bus operates at 30 MHz, is two bits wide, and achieves high utilization. Because each of its major components is paired (and its bus requires separate lines for clock and data), it is the most expensive of those available for commercial use (typically, a few hundred dollars per node). TTA is the next-most mature. In its TTA-bus conguration, it is vulnerable to spatial proximity faults, and its bus guardians are not fully independent of its interface controllers, so the TTA-star conguration is generally to be preferred. Its fault hypothesis includes arbitrary faults, and faults in several nodes (but only one per node), provided these arrive at least two rounds apart. It never gives up and has a well-dened recovery strategy from fault arrivals that exceed this hypothesis. It provides a form of interactively consistent message broadcasts and a consistent membership service. Proposed extensions provide state machine replication in a manner that is transparent to applications. Other proposed extensions provide a fully protected event-based service within the time-triggered framework. Its prototype implementations have been subjected to extensive testing and fault injections, and 41 deployed in experimental vehicles. Several of its algorithms have been formally veried, and aircraft applications under development are planned to lead to FAA certication. It is supported by an extensive tool suite that interfaces to standard CAD environments (Matlab/Simulink). Current implementations provide 25 Mbit/s data rates; research projects are designing implementations for gigabit rates. TTA controllers and the star coupler (which is basically a modied controller) are quite simple and cheap to produce in volume. SPIDER is a research project and it is unfair to compare it directly with the commercial products. Its fault hypothesis uses a hybrid fault model, which includes arbitrary faults, and allows some combinations of multiple faults. It provides interactively consistent message broadcasts. Its algorithms are novel and highly efcient and are being formally veried. It is planned to be used as a demonstration study for certication under the DO-254 guidelines for airborne hardware. SPIDER is interesting because it uses very strong algorithms and can use different topologies from the other buses. FlexRay is still under development. It has no stated fault hypothesis, and appears to have no mechanisms to counter certain fault modes (e.g., SOS faults or other sources of asymmetric broadcasts, and masquerading on startup). Its bus guardians are not fully independent of their controllers. Its clock synchronization can tolerate faults in no more than a third of its nodes, and its initialization in the presence of faults is not described. A never-give-up strategy is not described. It provides no services to its applications beyond best-efforts message delivery. Event-based services share the same bus; bus guardians protect only the time-triggered section of the bus cycle. No systematic or formal approaches to assurance or certication are described. It is the slowest of the commercial buses, with a claimed data rate of no more than 10 Mbit/s. It is asserted to be cheap to produce in volume, but this is questionable as each node requires three clock oscillators (one for the controller and one for each of the bus guardians). Some of the deciencies of FlexRay may be overcome as its development proceeds, but the decision to provide no services to support fault-tolerant applications seems a deliberate and irreversible design choice. This means that all mechanisms for fault-tolerant applications must be provided by the application programs themselves. Thus, application programmers, who may have little experience in the subtleties of fault-tolerant systems, become responsible for the design, implementation, and assurance of very delicate mechanisms with no support from the underlying bus architecture. Not only does this increase the cost and difculty of making sure that things are done right, it also increases their computational cost and latency. For example, in the absence of an interactively consistent message service provided by the architecture, application programs must explicitly transmit the multiple rounds of cross-comparisons that are needed to implement this service at a higher level, thereby substantially increasing the message load. Such a cost will invite inexperienced developers to seek less expensive ways to achieve fault tolerancein probable ignorance of the impossibility results in the theoretical literature, and the history of intractable Heisenbugs (rare, unrepeatable, failures) encountered by practitioners who pushed for 10 9 with inadequate foundations. 42 A safety-critical bus architecture provides certain properties and services that assist in construction of safety-critical systems. As with any system framework or middleware package, these buses offer a tradeoff to system developers: they provide a coherent collection of services, with strong properties and highly assured implementations, but developers must sacrice some design freedom to gain the full benet of these services. For example, all these buses use a time-triggered model of computation, and system developers must build their applications within that framework. In return, the buses are able to guarantee strong partitioning: faults in individual components or applications (functions in avionics terms) cannot propagate to others, nor can they bring down the entire bus (within the constraints of the fault hypothesis). Partitioning is the minimum requirement, however. It ensures that one failed function will not drag down others, but in most safety-critical systems the failure of even a single function can be catastrophic, so the individual functions must themselves be made fault tolerant. Accordingly, most of the buses provide mechanisms to assist the development of fault-tolerant applications. The key requirement here is interactively consistent message transfer: this ensures that all masters and shadows (or masters and monitors), or all members of a voting pool, maintain consistent state. Three of the buses considered here provide this basic service; some of them do so in association with other services, such as master/shadow rollover or group membership, that can be provided with much reduced latency when implemented at a low level. FlexRay, alone, provides none of these services. It is unlikely that any single bus architecture will satisfy all needs and markets, and it is to be expected that new or modied designs will emerge to satisfy new requirements. I hope that the comparison provided here will help potential users to select the existing bus best suited to their needs, and that it will help designers of new buses to learn from and build on the design choices made by their predecessors. Acknowledgments I am grateful for helpful comments received from Bruno Dutertre, Kurt Liebel, Paul Miner, Ginger Shao, and Christian Tanzer . 43 44 Bibliography [AG93] Anish Arora and Mohamed Gouda. Closure and convergence: A foundation of fault-tolerant computing. IEEE Transactions on Software Engineering, 19(11):10151027, November 1993. [ARI93] Aeronautical Radio, Inc., Annapolis, MD. ARINC Specication 659: Backplane Data Bus, December 1993. Prepared by the Airlines Electronic Engineering Committee. [ARI56] Aeronautical Radio, Inc., Annapolis, MD. ARINC Specication 629: MultiTransmitter Data Bus; Part 1, Technical Description (with ve supplements); Part 2, Application Guide (with one supplement), December 1995/6. Prepared by the Airlines Electronic Engineering Committee. [B + 01] Joef Berwanger et al. FlexRaythe communication system for advanced automotive control systems. In SAE 2001 World Congress, Detroit, MI, April 2001. Society of Automotive Engineers. Paper number 2001-01-0676. [BKL89] Susan S. Brilliant, John C. Knight, and Nancy G. Leveson. The consistent comparison problem in N-Version software. IEEE Transactions on Software Engineering, 15(11):14811485, November 1989. [BP00] G unther Bauer and Michael Paulitsch. An investigation of membership and clique avoidance in TTP/C. In 19th Symposium on Reliable Distributed Systems, Nuremberg, Germany, October 2000. [Bra00] David Bradbury. Simulation of a Time Triggered Protocol. Honours Thesis, Basser Department of Computer Science, Sydney University, Australia, 2000. Available from http://www.cs.usyd.edu.au/agathe/ pub/Thesis091100.pdf. [But92] Ricky W. Butler. The SURE approach to reliability analysis. IEEE Transactions on Reliability, 41(2):210218, June 1992. [Byt] Byteight Specication. Available at http://www.byteflight.com . 45 [Deu95] Deutsche Industrie Norm, Berlin, Germany. Probus Standard: DIN 19245, 1995. Two volumes; see also http://www.profibus.com . [DHS86] Danny Dolev, Joseph Y. Halpern, and H. Raymond Strong. On the possibility and impossibility of achieving clock synchronization. Journal of Computer and System Sciences, 32(2):230250, April 1986. [Dij74] Edsger W. Dijkstra. Self-stabilizing systems in spite of distributed control. Communications of the ACM, 17(11):643644, November 1974. [Dol82] Danny Dolev. The Byzantine Generals strike again. Journal of Algorithms, 3(1):1430, March 1982. [DW78] Daniel Davies and John F. Wakerly . Synchronization and matching in redundant systems. IEEE Transactions on Computers, C-27(6):531539, June 1978. [DW95] Shlomi Dolev and Jennifer L. Welch. Self-stabilizing clock synchronization with Byzantine faults. In Fourteenth ACM Symposium on Principles of Distributed Computing, page 256, Ottawa, Ontario, Canada, August 1995. Association for Computing Machinery. [Ech99] Echelon Corporation, Palo Alto, CA. Introduction to the LonWorks System, 1999. Available at http://osa.echelon.com/Program/ LonWorksIntroPDF.htm. [FAA88] Federal Aviation Administration. System Design and Analysis, June 21, 1988. Advisory Circular 25.1309-1A. [FLM86] Michael J. Fischer, Nancy A. Lynch, and Michael Merritt. Easy impossibility proofs for distributed consensus problems. Distributed Computing, 1:2639, 1986. [GLR95] Li Gong, Patrick Lincoln, and John Rushby. Byzantine agreement with authentication: Observations and applications in tolerating hybrid and link faults. In Ravishankar K. Iyer, Michele Morganti, W. Kent Fuchs, and Virgil Gligor, editors, Dependable Computing for Critical Applications5, volume 10 of Dependable Computing and Fault Tolerant Systems, pages 139157, Champaign, IL, September 1995. IEEE Computer Society. [GP93] Ajei S. Gopal and Kenneth J. Perry. Unifying self-stabilization and faulttolerance. In Twelfth ACM Symposium on Principles of Distributed Computing, pages 195206, Ithaca, NY, August 1993. Association for Computing Machinery. [HD92] Kenneth Hoyme and Kevin Driscoll. SAFEbus TM . In 11th AIAA/IEEE Digital Avionics Systems Conference, pages 6873, Seattle, WA, October 1992. 46 [HD93] Kenneth Hoyme and Kevin Driscoll. SAFEbus TM . IEEE Aerospace and Electronic Systems Magazine, 8(3):3439, March 1993. [HDHR91] Kenneth Hoyme, Kevin Driscoll, Jack Herrlin, and Kathie Radke. ARINC 629 and SAFEbus TM : Data buses for commercial aircraft. Scientic Honeyweller, pages 5770, Fall 1991. [Hop88] Harry Hopkins. Fit and forget y-by-wire. Flight International, pages 8992, December 3, 1988. [HT98] G unter Heiner and Thomas Thurner. Time-triggered architecture for safetyrelated distributed real-time systems in transportation systems. In Fault Tolerant Computing Symposium 28, pages 402407, Munich, Germany, June 1998. IEEE Computer Society. [IRM84] Stephen D. Ishmael, Victoria A. Regenie, and Dale A. Mackall. Design implications from AFTI/F16 ight test. NASA Technical Memorandum 86026, NASA Ames Research Center, Dryden Flight Research Facility, Edwards, CA, 1984. [ISO93] International Standards Organization, Switzerland. ISO Standard 11898: Road VehiclesInterchange of Digital InformationController Area Network (CAN) for High-Speed Communication, November 1993. [KB00] Hermann Kopetz and G unther Bauer. Transparent redundancy in the timetriggered architecture. In The International Conference on Dependable Systems and Networks, pages 513, New York, NY, June 2000. IEEE Computer Society. [KBP01] Hermann Kopetz, G unther Bauer, and Stefan Poledna. Tolerating arbitrary node failures in the Time-Triggered Architecture. In SAE 2001 World Congress, Detroit, MI, March 2001. Society of Automotive Engineers. SAE paper number 2001-01-0677. [KG93] H. Kopetz and G. Gr unsteidl. TTPa time-triggered protocol for fault-tolerant real-time systems. In Fault Tolerant Computing Symposium 23, pages 524 533, Toulouse, France, June 1993. IEEE Computer Society. [KG94] Hermann Kopetz and G unter Gr unsteidl. TTPa protocol for fault-tolerant real-time systems. IEEE Computer, 27(1):1423, January 1994. [KHE00] H. Kopetz, M. Holzmann, and W. Elmenreich. A universal smart transducer interface: TTP/A. In Third IEEE International Symposium on Object-Oriented Real-Time Distributed Computing, Newport Beach, CA, March 2000. IEEE Computer Society. 47 [KWFT88] R. M. Kieckhafer, C. J. Walter, A. M. Finn, and P. M. Thambidurai. The MAFT architecture for distributed fault tolerance. IEEE Transactions on Computers, 37(4):398405, April 1988. [Lal86] Jaynarayan H. Lala. A Byzantine resilient fault tolerant computer for nuclear power application. In Fault Tolerant Computing Symposium 16, pages 338 343, Vienna, Austria, July 1986. IEEE Computer Society. [Lam83] Leslie Lamport. The weak Byzantine Generals problem. Journal of the ACM, 30(3):668676, July 1983. [LIN00] Local interconnect network (LIN). See http://www.lin-subbus. org/, 2000. [LMS85] L. Lamport and P. M. Melliar-Smith. Synchronizing clocks in the presence of faults. Journal of the ACM, 32(1):5278, January 1985. [LR94] Patrick Lincoln and John Rushby. Formal verication of an interactive consistency algorithm for the Draper FTP architecture under a hybrid fault model. In COMPASS '94 (Proceedings of the Ninth Annual Conference on Computer Assurance), pages 107120, Gaithersburg, MD, June 1994. IEEE Washington Section. [LSP82] Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine Generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382401, July 1982. [LT82] E. Lloyd and W. Tye. Systematic Safety: Safety Assessment of Aircraft Systems. Civil Aviation Authority, London, England, 1982. Reprinted 1992. [Lyn96] Nancy A. Lynch. Distributed Algorithms. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, San Francisco, CA, 1996. [Mac85] Dale A. Mackall. Qualication needs for advanced integrated aircraft. NASA Technical Memorandum 86731, NASA Ames Research Center, Dryden Flight Research Facility, Edwards, CA, 1985. [Mac88] Dale A. Mackall. Development and ight test experiences with a ight-crucial digital control system. NASA Technical Paper 2857, NASA Ames Research Center, Dryden Flight Research Facility, Edwards, CA, 1988. [Mer01] Agathe Merceron. Proving no cliques in a protocol. In 24th Australasian Computer Science Conference, Gold Coast, Queensland, Australia, January/February 2001. IEEE Computer Society. Available from http: //www.cs.usyd.edu.au/agathe/pub/goldCoastR.pdf. 48 [Min93] Paul S. Miner. Verication of fault-tolerant clock synchronization systems. NASA Technical Paper 3349, NASA Langley Research Center, Hampton, VA, November 1993. [Min00] Paul S. Miner. Analysis of the SPIDER fault-tolerance protocols. In C. Michael Holloway, editor, LFM 2000: Fifth NASA Langley Formal Methods Workshop, Hampton, VA, June 2000. NASA Langley Research Center. Slides available at http://shemesh.larc.nasa.gov/fm/ Lfm2000/Presentations/lfm2000-spider/ . [MIS94] The Motor Industry Software Reliability Association, Nuneaton, UK. Development Guidelines for Vehicle Based Software, PDF version 1.1, January 2001 edition, November 1994. Available at http://www.misra.org.uk. [MMP99] Agathe Merceron, Monika Muellerburg, and G. Michele Pinna. No collisions in a protocol with n stations: A comparative study of formal proofs. In 5th International Workshop on Formal Methods for Industrial Critical Systems, Part of the Federated Logic Conference, Trento, Italy, July 1999. [Pal96] Daniel L. Palumbo. Fault-tolerant processing system. United States Patent 5,533,188, July 2, 1996. [Pfe00] Holger Pfeifer. Formal verication of the TTA group membership algorithm. In Tommaso Bolognesi and Diego Latella, editors, Formal Description Techniques and Protocol Specication, Testing and Verication FORTE XIII/PSTV XX 2000, pages 318, Pisa, Italy, October 2000. Kluwer Academic Publishers. [PSL80] M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults. Journal of the ACM, 27(2):228234, April 1980. [PSvH99] Holger Pfeifer, Detlef Schwier, and Friedrich W. von Henke. Formal verication for time-triggered clock synchronization. In Charles B. Weinstock and John Rushby, editors, Dependable Computing for Critical Applications 7, volume 12 of Dependable Computing and Fault Tolerant Systems, pages 207226, San Jose, CA, January 1999. IEEE Computer Society. [R TC92] Requirements and Technical Concepts for Aviation, Washington, DC. DO178B: Software Considerations in Airborne Systems and Equipment Certication, December 1992. This document is known as EUROCAE ED-12B in Europe. [RTC00] Requirements and Technical Concepts for Aviation, Washington, DC. DO254: Design Assurance Guidelines for Airborne Electronic Hardware, April 2000. 49 [Rus93a] John Rushby. A fault-masking and transient-recovery model for digital ightcontrol systems. In Jan Vytopil, editor, Formal Techniques in Real-Time and Fault-T olerant Systems, Kluwer International Series in Engineering and Computer Science, chapter 5, pages 109136. Kluwer, Boston, Dordecht, London, 1993. [Rus93b] John Rushby. Formal methods and digital systems validation for airborne systems. Technical Report SRI-CSL-93-7, Computer Science Laboratory, SRI International, Menlo Park, CA, December 1993. Also available as NASA Contractor Report 4551, December 1993. [Rus94] John Rushby. A formally veried algorithm for clock synchronization under a hybrid fault model. In Thirteenth ACM Symposium on Principles of Distributed Computing, pages 304313, Los Angeles, CA, August 1994. Association for Computing Machinery. Also available as NASA Contractor Report 198289. [Rus95] John Rushby. Formal methods and their role in the certication of critical systems. In Roger Shaw, editor, Safety and Reliability of Software Based Systems (Twelfth Annual CSR Workshop), pages 142, Bruges, Belgium, September 1995. Springer. Also issued as part of the FAA Digital Systems Validation Handbook (the guide for aircraft certication). [Rus96] John Rushby. Reconguration and transient recovery in state-machine architectures. In Fault Tolerant Computing Symposium 26, pages 615, Sendai, Japan, June 1996. IEEE Computer Society. [Rus01] John Rushby. Formal verication of transmission window timing for the timetriggered architecture. Project report, Computer Science Laboratory, SRI International, Menlo Park, CA, March 2001. Available at http://www.csl. sri.com/rushby/papers/windowtiming.pdf . [Sch87] Fred B. Schneider. Understanding protocols for Byzantine clock synchronization. Technical Report 87-859, Department of Computer Science, Cornell University, Ithaca, NY, August 1987. [Sch90] Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys, 22(4):299319, December 1990. [Sch93] Marco Schneider. Self stabilization. ACM Computing Surveys, 25(1):4567, March 1993. [Sch00] Ulrich Schmid. How to model link failures: A perception-based fault model. Technical Report 183/1-108, Technical University of Vienna, Department of Automation, October 2000. (To be presented at DSN 2001.). 50 [SD95] William Sweet and Dave Dooling. Boeing's seventh wonder. IEEE Spectrum, 32(10):2023, October 1995. [SG84] Alfred Spector and David Gifford. Case study: The space shuttle primary computer system. Communications of the ACM, 27(9):872900, September 1984. [ST87] T. K. Srikanth and Sam Toueg. Optimal clock synchronization. Journal of the ACM, 34(3):626645, July 1987. [SvH98] D. Schwier and F. von Henke. Mechanical verication of clock synchronization algorithms. In Formal Techniques in Real-Time and Fault-T olerant Systems, volume 1486 of Lecture Notes in Computer Science, pages 262271, Lyngby , Denmark, September 1998. Springer-Verlag. [TP88] Philip Thambidurai and You-Keun Park. Interactive consistency with multiple failure modes. In 7th Symposium on Reliable Distributed Systems, pages 93 100, Columbus, OH, October 1988. IEEE Computer Society. [TTT99] Time-Triggered Technology TTTech Computertechnik AG, Vienna, Austria. Specication of the TTP/C Protocol, July 1999. [WL88] J. Lundelius Welch and N. Lynch. A new fault-tolerant algorithm for clock synchronization. Information and Computation, 77(1):136, April 1988. [WLG + 78] John H. Wensley , Leslie Lamport, Jack Goldberg, Milton W. Green, Karl N. Levitt, P. M. Melliar-Smith, Robert E. Shostak, and Charles B. Weinstock. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proceedings of the IEEE, 66(10):12401255, October 1978. [WLS97] Chris J. Walter, Patrick Lincoln, and Neeraj Suri. Formally veried online diagnosis. IEEE Transactions on Software Engineering, 23(11):684721, November 1997. 51