emsoft01.dvi Presented at EMSOFT 2001: First Workshop on Embedded Software, 810 October 2001, Lake Tahoe CA. Springer-Verlag Lecture Notes in Computer Science, Vol. 2211, pp. 306323. c flSpringer-Verlag Bus Architectures For Safety-Critical Embedded Systems ? John Rushby Computer Science Laboratory SRI International 333 Ravenswood Avenue Menlo Park, CA 94025, USA rushby@csl.sri.com Abstract. Embedded systems for safety-critical applications often integrate multiple functions and must generally be fault-tolerant. These requirements lead to a need for mechanisms and services that provide protection against fault propagation and ease the construction of distributed fault-tolerant applications. A number of bus architectures have been developed to satisfy this need. This paper reviews the requirements on these architectures, the mechanisms employed, and the services provided. Four representative architectures (SAFEbus TM , SPIDER, TTA, and FlexRay) are briey described. 1 Introduction Embedded systems generally operate as closed-loop control systems: they repeatedly sample sensors, calculate appropriate control responses, and send those responses to actuators. In safety-critical applications, such as y- and drive-by-wire (where there are no direct connections between the pilot and the aircraft control surfaces, nor between the driver and the car steering and brakes), requirements for ultra-high reliability demand fault tolerance and extensive redundancy. The embedded system then becomes a distributed one, and the basic control loop is complicated by mechanisms for synchronization, voting, and redundancy management. Systems used in safety-critical applications have traditionally been federated, meaning that each function (e.g., autopilot or autothrottle in an aircraft, and brakes or suspension in a car) has its own fault-tolerant embedded control system with only minor interconnections to the systems of other functions. This provides a strong barrier to fault propagation: because the systems supporting different functions do not share resources, the failure of one function has little effect on the continued operation of others. The federated approach is expensive, however (because each function has its own replicated system), so recent applications are moving toward more integrated solutions in which some resources are shared across different functions. The new danger here is ? This research was supported by the DARPA MOBIES and NEST programs through USAF Rome Laboratory contracts F33615-00-C-1700 and F33615-01-C-1908, and by NASA Langley Research Center under contract NAS1-20334 and Cooperative Agreement NCC-1-377 with Honeywell Incorporated. 1 that faults may propagate from one function to another; partitioning is the problem of restoring to integrated systems the strong defenses against fault propagation that are naturally present in federated systems. A dual issue is that of strong composability: here we would like to take separately developed functions and have them run without interference on an integrated system platform with negligible integration effort. The problems of fault tolerance, partitioning, and strong composability are challenging ones. If handled in an ad-hoc manner, their mechanisms can become the primary sources of faults and of unreliability in the resulting architecture [10]. Fortunately, most aspects of these problems are independent of the particular functions concerned, and they can be handled in a principled and correct manner by generic mechanisms implemented as an architecture for distributed embedded systems. One of the essential services provided by this kind of architecture is communication of information from one distributed component to another, so a (physical or logical) communication bus is one of its principal components, and the protocols used for control and communication on the bus are among its principal mechanisms. Consequently, these architectures are often referred to as buses (or databuses), although this term understates their complexity, sophistication, and criticality. In truth, these architectures are the safety-critical core of the applications built above them, and the choice of services to provide to those applications, and the mechanisms of their implementation, are issues of major importance in the construction and certication of safety-critical embedded systems. In this paper, I survey some of the issues in the design of bus architectures for safetycritical embedded systems. I hope this will prove useful to potential users of these architectures and will alert others to the benets of building on such well-considered foundations. My presentation is derived from a review of four representative architectures: two of these were primarily designed for aircraft applications and two for automobiles. The economies of scale make the automobile buses quite inexpensivewhich then renders them attractive in certain aircraft applications. The aircraft buses considered are the Honeywell SAFEbus [1, 7] (the backplane dat bus used in the Boeing 777 Airplane Information Management System) and the NASA SPIDER [11] (an architecture being developed as a demonstrator for certication under the new DO254 guidelines [15]); the automobile buses considered are the TTTech Time-Triggered Architecture (TTA) [8, 24], recently adopted by Audi and Volkswagen for automobile applications, and by Honeywell for avionics and aircraft controls functions, and FlexRay [3], which is being developed by a consortium of BMW, DaimlerChrysler, Motorola, and Philips. A detailed comparison of these four architectures, along with more extended discussion of the issues, is available as a technical report [17]. The paper is organized as follows: Section 2 examines general issues in timetriggered systems and bus architectures, Section 3 examines the fault hypotheses under which they operate, and Section 4 describes the services that they provide; Section 5 briey describes the four representative architectures, and conclusions are provided in Section 6. 2 Host Interface Host Interface Host Interface Host Interface Bus Fig. 1. Bus Interconnect 2 Time-Triggered Buses The architectures considered here are called buses because multicast or broadcast communication is one of the services that they provide, and their implementations are based on a logical or physical bus. In a generic bus architecture, application programs run in host computers, and sensors and actuators are also connected to the hosts; an interconnect medium provides broadcast communications, and interface devices connect the hosts to the interconnect. The interfaces and interconnect comprise the bus; the combination of a host and its interface(s) is referred to as a node. Realizations of the interconnect may be a physical (passive) bus, as shown in Figure 1, or a centralized (active) hub, as shown in Figure 2. The interfaces may be physically proximate to the hosts, or they may form part of a more complex central hub. Many of the components will be replicated for fault tolerance. All four of the buses considered here are primarily time triggered ; this is a fundamental design choice that inuences many aspects of their architectures and mechanisms, and sets them apart from fundamentally event-triggered buses such as Byteight, CAN, Ethernet, LonWorks, or Probus. The time-triggered and event-triggered approaches to systems design nd favor in different application areas, and each has strong advocates; for integrated, safety-critical systems, however, the time-triggered approach is generally preferred. Time triggered means that all activities involving the bus, and often those involving components attached to the bus, are driven by the passage of time (if it is 20 ms since the start of the frame, then read the sensor and broadcast its value); this is distinguished from event triggered, which means that activities are driven by the occurrence of events (if the sensor reading changes, then broadcast its new value). A prime contrast between these two approaches is their locus of control: a time-triggered system controls its own activity and interacts with the environment according to an internal schedule, whereas an event-triggered system is under the control of its environment and must respond to stimuli as they occur. 3 Host Interface Host Interface Host Interface Host Interface Star Hub Fig. 2. Star Interconnect Event-triggered systems allow exible allocation of resources and this is attractive when demands are highly variable. However, in safety-critical applications it is necessary to guarantee some basic quality of service to all participants, even (or especially) in the presence of faults. Because the clients of the bus architecture are real-time embedded control systems, the required guarantees include predictable communications with low latency and low jitter (assured bandwidth is not enough). The problem with event-driven buses is that events arriving at different nodes may cause them to contend for access to the bus, so some form of media access control (i.e., a distributed mutual exclusion algorithm) is needed to ensure that each node eventually is able to transmit without interruption. The important issue is how predictable is the access achieved by each node, and how strong is the assurance that the predictions remain true in the presence of faults. Buses such as Ethernet resolve contention probabilistically and therefore can provide only probabilistic guarantees of timely access, and no assurance at all in the presence of faults. Buses for non-safety-critical embedded systems such as CAN, LonWorks, or Probus use various priority, preassigned slot, or token schemes to resolve contention deterministically. In CAN, for example, the message with the lowest number always wins the arbitration and therefore has to wait only for the current message to nish, while other messages must also wait for any lower-numbered messages. Thus, although contention is resolved deterministically, latency increases with load and can be bounded with only probabilistic guaranteesand these can be quite weak in the presence of faults (e.g., the current message may be retransmitted in the case of transmission failure, thereby delaying the next message, even if this has higher priority). Furthermore, faulty nodes may not adhere to expected patterns of use and may make excessive demands for service, thereby reducing that available to others. Event-triggered buses for safety-critical applications add various mechanisms to limit such demands. ARINC 629 (an avionics data bus used in the Boeing 777), for example, uses a technique some4 times referred to as minislotting that requires each node to wait a certain period after sending a message before it can contend to send another. Even here, however, latency is a function of load, so the Byteight automobile protocol developed by BMW extends this mechanism with guaranteed, preallocated slots for critical messages. But even preallocated slots provide no protection against a faulty node that fails to recognize them. The worst manifestation of this kind of fault is the so-called babbling idiot failure where a faulty node transmits constantly, thereby compromising the operation of the entire bus. In a time-triggered bus, there is a static preallocation of communication bandwidth in the form of a global schedule: each node knows the schedule and knows the time, and therefore knows when it is allowed to send messages, and when it should expect to receive them. Thus, contention is resolved at design time (as the schedule is constructed), when all its consequences can be examined, rather than at run time. A static schedule makes possible the control of the babbling idiot failure mode. This is achieved by interposing an independent component, called a bus guardian, that allows each node to transmit on the bus only when it is allowed to do so. The guardian must know when its node is allowed to access the bus, which is difcult to achieve in an event-triggered system but is conceptually simple in a time triggered system: the guardian has an independent clock and independent knowledge of the schedule and allows its node to broadcast only when indicated by the schedule. Because all communication is triggered by the global schedule, there is no need to attach source or destination addresses to messages sent over a time-triggered bus: each node knows the sender and intended recipients of each message by virtue of the time at which it is sent. Elimination of the address elds not only reduces the size of each message, thereby greatly increasing the message bandwidth of the bus (messages are typically short in embedded control applications), but it also eliminates a potential source of serious faults: namely, the possibility that a faulty node may send messages to the wrong recipients or, worse, may masquerade as a sender other than itself. Fault-tolerant clock synchronization is a fundamental requirement for a timetriggered bus architecture: the abstraction of a global clock is realized by each node having a local clock that is closely synchronized with the clocks of all other nodes. Tightness of the bus schedule, and hence the throughput of the bus, is strongly related to the quality of global clock synchronization that can be achievedand this is related to the quality of the clock oscillators local to each node, and to the algorithm used to synchronize them. There are two basic classes of algorithm for clock synchronization: those based on averaging and those based on events. Averaging works by each node measuring the skew between its clock and that of each other node (e.g., by comparing the arrival time of each message with its expected value) then setting its clock to some average value. A simple average (e.g., the mean or median) over all clocks may be affected by wild readings from faulty clocks (which might provide different, or missing, readings to different observers), so we need a fault-tolerant average that is largely insensitive to a certain number of readings from faulty clocks. Schneider [18] gives a general description that applies to all averaging clock synchronization algorithms; these algorithms differ only in their choice of fault-tolerant average. The Welch-L ynch algorithm [25] is a popular choice that is characterized by use of the fault-tolerant 5 midpoint as its averaging function. Event-based algorithms rely on nodes being able to sense events directly on the interconnect: each node broadcasts a ready event when it is time to synchronize and sets its clock when it has seen a certain number of events from other nodes. Depending on the fault model, additional waves of echo or accept events may be needed to make this fault tolerant. The number of faulty nodes that can be tolerated, and the quality of synchronization that can be achieved, depend on the details of the algorithm, and on the fault hypothesis under which it operates. The event-based algorithm of Srikanth and Toueg [21] is particularly attractive because it achieves optimal accuracy. 3 Fault Hypotheses and Fault Containment Units Safety-critical aerospace functions are generally required to have failure rates less than 10 9 per hour [5], and an architecture that is intended to support several such functions should provide assurance of failure rates better than 10 10 per hour. Similar requirements apply to cars (although higher rates of loss are accepted for individual cars than aircraft, there are vastly more of them, so the required failure rates are similar). Consumer-grade electronics devices have failure rates many orders of magnitude worse than this, so redundancy and fault tolerance are essential elements of a bus architecture. Redundancy may include replication of the entire bus, of the interconnect and/or the interfaces, or decomposition of those elements into smaller subcomponents that are then replicated. Fault tolerance takes two forms in these architectures: rst is that which ensures that the bus itself does not fail, second is that which eases the construction of fault-tolerant applications. Each of these mechanisms must be constructed and validated against an explicit fault hypothesis, and must deliver specied services (that may be specied to degrade in acceptable ways in the presence of faults). The fault hypothesis must describe the modes (i.e., kinds) of faults that are to be tolerated, and their maximum number and arrival rate. The fault hypothesis must also identify the different fault containment units (FCUs) in the design: these are the components that can independently be aficted by faults. The division of an architecture into separate FCUs needs careful justication: there must be no propagation of faults from one FCU to another, and no common mode failures where a single physical event produces faults in multiple FCUs. Only physical faults (those caused by damage to, defects in, or aging of the devices employed, or by external disturbances such as cosmic rays, and electromagnetic interference) are considered in this analysis: design faults must be excluded, and must be shown to be so by stringent assurance and certication processes. The assumption that failures of separate FCUs are independent must be ensured by careful design and assured by stringent analysis. True independence generally requires that different FCUs are served by different power supplies, and are physically and electrically isolated from each other. Providing this level of independence is expensive and it is generally undertaken only in aircraft applications. In cars, it is common to make some small compromises on independence: for example, the guardians may be fabricated on the same chip as the interface (but with their own clock oscillators), or the interface may be fabricated on the same chip as the host processor. It is necessary to 6 examine these compromises carefully to ensure that the loss in independence applies only to fault modes that are benign, extremely rare, or tolerated by other mechanisms. A fault mode describes the kind of behavior that a faulty FCU may exhibit. The same fault may exhibit different modes at different levels of a protocol hierarchy: for example, at the electrical level, the fault mode of a faulty line driver may be that it sends an intermediate voltage (one that is neither a digital 0 nor a digital 1), while at the message level the mode of the same fault may be Byzantine, meaning that different receivers interpret the same message in different ways (because some see the intermediate voltage as a 0, and others as a 1). Some protocols can tolerate Byzantine faults, others cannot; for those that cannot, we must show that the fault mode is controlled at the underlying electrical level. The basic dimensions that a fault can affect are value, time, and space. A value fault is one that causes an incorrect value to be computed, transmitted, or received (whether as a physical voltage, a logical message, or some other representation); a timing fault is one that causes a value to be computed, transmitted, or received at the wrong time (whether too early, too late, or not at all); a spatial proximity fault is one where all matter in some specied volume is destroyed (potentially aficting multiple FCUs). Bus-based interconnects of the kind shown in Figure 1 are vulnerable to spatial proximity faults: all redundant buses necessarily come into close proximity at each node, and general destruction in that space could sever or disrupt them all. Interconnect topologies with a central hub are far more resilient in this regard: a spatial proximity fault that destroys one or more nodes does not disrupt communication among the others (the hub may need to isolate the lines to the destroyed nodes in case these are shorted), and destruction of a hub can be tolerated if there is a duplicate in another location. There are many ways to classify the effects of faults in any of the basic dimensions. One classication that has proved particularly effective in analysis of the types of algorithms that underlie the architectures considered here is the hybrid fault model of Thambidurai and Park [23]. In this classication, the effect of a fault may be manifest, meaning that it is reliably detected (e.g., a fault that causes an FCU to cease transmitting messages), symmetric meaning that whatever the effect, it is the same for all observers (e.g., an off-by-1 error), or arbitrary, meaning that it is entirely unconstrained. In particular, an arbitrary fault may be asymmetric or Byzantine, meaning that its effect is perceived differently by different observers (as in the intermediate voltage example). The great advantage to designs that can tolerate arbitrary fault modes is that we do not have to justify assumptions about more specic fault modes: a system is shown to tolerate (say) two arbitrary faults by proving that it works in the presence of two faulty FCUs with no assumptions whatsoever on the behavior of the faulty components. A system that can tolerate only specic fault modes may fail if confronted by a different fault mode, so it is necessary to provide assurance that such modes cannot occur. It is this absence of assumptions that is the attraction, in safety-critical contexts, of systems that can tolerate arbitrary faults. This point is often misunderstood and such systems are derided as being focused on asymmetric or Byzantine faults, which never arise in practice. Byzantine faults are just one manifestation of arbitrary behavior, and cannot simply be asserted not to occur (in fact, they have been observed in several systems that have been monitored sufciently closely). One situation that is likely to provoke 7 asymmetric manifestations is a slightly out of specication (SOS) fault, such as the intermediate electrical voltage mentioned earlier. SOS faults in the timing dimension include those that put a signal edge very close to a clock edge, or that have signals with very slow rise and fall times (i.e., weak edges). Depending on the timing of their own clock edges, some receivers may recognize and latch such a signal, others may not, resulting in asymmetric or Byzantine behavior. FCUs may be active (e.g., a processor) or passive (e.g., a bus); while an arbitraryfaulty active component can do anything, a passive component may change, lose, or delay data, but it cannot spontaneously create a new datum. Keyed checksums or digital signatures can sometimes be used to reduce the fault modes of an active FCU to those of a passive one. (An arbitrary-faulty active FCU can always create its own messages, but it cannot create messages purporting to come from another FCU if it does not know the key of that FCU; signatures need to be managed carefully for this reduction in fault mode to be credible.) Any fault-tolerant architecture will fail if subjected to too many faults; generally speaking, it requires more redundancy to tolerate an arbitrary fault than a symmetric one, which in turn requires more redundancy than a manifest fault. The most effective fault-tolerant algorithms make this tradeoff automatically between number and difculty of faults tolerated. For example, the clock synchronization algorithm of [16] can tolerate a arbitrary faults, s symmetric, and m manifest ones simultaneously provided n, the number of FCUs, satises n > 3a + 2s + m. It is provably impossible (i.e., it can be proven that no algorithm can exist) to tolerate a arbitrary faults in clock synchronization with fewer than 3a + 1 FCUs (unless digital signatures are employedwhich is equivalent to reducing the severity of the arbitrary fault mode). Because it is algorithmically much easier to tolerate simple failure modes, some architectures (e.g., SAFEbus) arrange FCUs (the Bus Interface Units in the case of SAFEbus) in self-checking pairs: if the members of a pair disagree, they go ofine, ensuring that the effect of their failure is seen as a manifest fault (i.e., one that is easily tolerated). Most architectures also employ substantial self-checking in each FCU; any FCU that detects a fault will shut down, thereby ensuring that its failure will be manifest. (This kind of operation is often called fail silence). Even with extensive self-checking and pairwise-checking, it may be possible for some fault modes to escape, so it is generally necessary to show either that the mechanisms used have complete coverage (i.e., there will be no violation of fail silence), or to design the architecture so that it can tolerate the escape of at least one arbitrary fault. Some architectures can tolerate only a single fault at a time, but can recongure to exclude faulty FCUs and are then able to tolerate additional faults. In such cases, the fault arrival rate is important: faults must not arrive faster than the architecture can recongure. The architectures considered here operate according to static schedules, which consist of rounds or frames that are executed repeatedly in a cyclic fashion. The acceptable fault arrival rate is often then expressed in terms of faults per round (or the inverse). It is usually important that every node is scheduled to make at least one broadcast in every round, since this is how fault status is indicated (and hence how reconguration is triggered). 8 Historical experience and analysis must be used to show that the hypothesized modes, numbers, and arrival rate are realistic, and that the architecture can indeed operate correctly under those hypotheses for its intended mission time. But sometimes things go wrong: the system may experience many simultaneous faults (e.g., from unanticipated high-intensity radiated elds (HIRF)), or other violations of its fault hypothesis. We cannot guarantee correct operation in such cases (otherwise our fault hypothesis was too conservative), but safety-critical systems generally are constructed to a never give up philosophy and will attempt to continue operation in a degraded mode. The usual method of operation in never give up mode is that each node reverts to local control of its own actuators using the best information available (e.g., each brake node applies braking force proportional to pedal pressure if it is still receiving that input, and removes all braking force if not), while at the same time attempting to regain coordination with its peers. Although it is difcult to provide assurance of correct operation during these upsets, it may be possible to provide assurance that the system returns to normal operation once the faults cease (assuming they were transients) using the ideas of self-stabilization [20]. Restart during operation may be necessary if HIRF or other environmental inuences lead to violation of the fault hypothesis and cause a complete failure of the bus. Notice that this failure must be detected by the bus, and the restart must be automatic and very fast: most control systems can tolerate loss of control inputs for only a few cycleslonger outages will lead to loss of control. For example, Heiner and Thurner estimate that the maximum transient outage time for a steer-by-wire automobile application is 50ms [6]. Restart is usually initiated when an interface detects no activity on any bus line for some interval; that interface will then transmit some wake up message on all lines. Of course, it is possible that the interface in question is faulty (and there was bus activity all along but that interface did not detect it), or that two interfaces decide simultaneously to send the wake up call. The rst possibility must be avoided by careful checking, preferably by independent units (e.g., both interfaces of a pair, or an interface and its guardian); the second requires some form of collision detection and resolution: this should be deterministic to guarantee an upper bound on the time to reach resolution (that will allow a single interface can send an uninterrupted wake up message) and, ideally, should not depend on collision detection (because this cannot be done reliably). Notice that it must be possible to perform startup and restart reliably even in the presence of faulty components. 4 Services The essential basic purpose of these bus architectures is to make it possible to build reliable distributed applications; a desirable purpose is to make it straightforward to build such applications. The basic services provided by the bus architectures considered here comprise clock synchronization, time-triggered activation, and reliable message delivery. Some of the architectures provide additional services; their purpose is to assist straightforward construction of reliable distributed applications by providing these services in an application-independent manner, thereby relieving the applications 9 of the need to implement these capabilities themselves. Not only does this simplify the construction of application software, it is sometimes possible to provide better services when these are implemented at the architecture level, and it is also possible to provide strong assurance that they are implemented correctly. Applications that perform safety-critical functions must generally be replicated for fault tolerance. There are many ways to organize fault-tolerant replicated computations, but a basic distinction is between those that use exact agreement, and those that use approximate agreement. Systems that use approximate agreement generally run several copies of the application in different nodes, each using its own sensors, with little coordination across the different nodes. The motivation for this is a folk belief that it promotes fault tolerance: coordination is believed to introduce the potential for common mode failures. Because different sensors cannot be expected to deliver exactly the same readings, the outputs (i.e., actuator commands) computed in the different nodes will also differ. Thus, the only way to detect faulty outputs is by looking for values that differ by a lot from the others. Hence, these systems use some form of selection or threshold voting to select a good value to send to the actuators, and similar techniques to identify faulty nodes that should be excluded from the conguration. A difculty for applications of the kind considered here is that hosts accumulate state that diverges from that of others over time (e.g., velocity and position as a result of integrating acceleration), and they execute mode switches that are discrete decisions based on local sensor values (e.g., change the gain schedule in the control laws if the altitude, or temperature, is above a specic value). Thus small differences in sensor readings can lead to major differences in outputs and this can mislead the approximate selection or voting mechanisms into choosing a faulty value, or excluding a nonfaulty node. The x to these problems is to attempt to coordinate discrete mode switches and periodically to bring state data into convergence. But these xes are highly application specic, and they are contrary to the original philosophy that motivated the choice of approximate agreementhence, there is a good chance of doing them wrong. There are numerous examples that justify this concern; several that were discovered in ight tests are documented by Mackall and colleagues [10]. The essential points of Mackall's data is that all the failures observed in ight test were due to bugs in the design of the fault tolerance mechanisms themselves, and all these bugs could be traced to difculties in organizing and coordinating systems based on approximate agreement. Systems based on exact agreement face up to the fact that coordination among replicated computations is necessary, and they take the necessary steps to do it right. If we are to use exact agreement, then every replica must perform the same computation on the same data: any disagreement on the outputs then indicates a fault; comparison can be used to detect those faults, and majority voting to mask them. A vital element in this approach to fault tolerance is that replicated components must work on the same data: thus, if one node reads a sensor, it must distribute that reading to all the redundant copies of the application running in other nodes. Now a fault in that distribution mechanism could result in one node getting one value and another a different one (or no value at all). This would abrogate the requirement that all replicas obtain identical inputs, so we need to employ mechanisms to overcome this behavior. 10 The problem of distributing data consistently in the presence of faults is variously called interactive consistency, consensus, atomic broadcast, or Byzantine agreement [9, 12]. When a node transmits a message to several receivers, interactive consistency requires the following two properties to hold. Agreement: All nonfaulty receivers obtain the same message (even if the transmitting node is faulty). Validity: If the transmitter is nonfaulty, then nonfaulty receivers obtain the message actually sent. Algorithms for achieving these requirements in the presence of arbitrary faults necessarily involve more than a single data exchange (basically, each receiver must compare the value it received against those received by others). It is provably impossible to achieve interactive consistency in the presence of a arbitrary faults unless there are at least 3a + 1 FCUs, 2a + 1 disjoint communication paths between them, and a + 1 levels (or rounds) of communication. The number of FCUs and the number of disjoint paths required, but not the number of rounds, can be reduced by using digital signatures. The problem might seem moot in architectures that employ a physical bus, since a bus surely cannot deliver values inconsistently (so the agreement property is achieved trivially). Unfortunately, it canthough it is likely to be a very rare event. The scenarios involving SOS faults presented earlier exemplify some possibilities. Dealing properly with very rare events is one of the attributes that distinguishes a design that is t for safety-critical systems from one that is not. It follows that either the application software must perform interactive consistency for itself (incurring the cost of n 2 messages to establish consistency across n nodes in the presence of a single arbitrary fault), or the bus architecture must do it. The rst choice is so unattractive that it vitiates the whole purpose of a fault-tolerant bus architecture. Most bus architectures therefore provide some type of interactively consistent message broadcast as a basic service. In addition, most architectures take steps to reduce the incidence of asymmetric transmissions (i.e., those that appear as one value to some receivers, and as different values, or the absence of values, to others). As noted, SOS faults are among the most plausible sources of asymmetric transmissions. SOS faults that cause asymmetric transmissions can arise in either the value or time domains (e.g., intermediate voltages, or weak edges, respectively). In those architectures that employ a bus guardian in a central hub or in series with each interface, the bus guardians are a possible point of intervention for the control of SOS faults: a suitable guardian can reshape, in both value and time domains, the signal sent to it by the controller. Of course, the guardian could be faulty and may make matters worseso this approach makes sense only when there are independent guardians on each of two (or more) replicated interconnects. Observe that for credible signal reshaping, the guardian must have a power supply that is independent of that of the controller (faults in power supply are the most likely cause of intermediate voltages and weak edges). Interactively consistent message broadcast provides the foundation for fault tolerance based on exact agreement. There are several ways to use this foundation. One arrangement, confusingly called the state machine approach [19], is based on majority voting: application replicas run on a number of different nodes, exchange their output values, and deliver a majority vote to the actuators. 11 Another arrangement is based on self-checking (either by individuals or pairs) so that faults result in fail-silence. This will be detected by other nodes, and some backup application running in those other nodes can take over. The architecture can assist this master/shadow arrangement by providing services that support the rollover from one node to another. One such service automatically substitutes a backup node for a failed master (both the master and the backup occupy the same slot in the schedule, but the backup is inhibited from transmitting unless the master is failed). A variant has both master and backup operating in different slots, but the backup inhibits itself unless it is informed that the master has failed. A further variation, called compensation, applies when different nodes have access to different actuators: none is a direct backup to any other, but each changes its operation when informed that others have failed (an example is car braking: separate nodes controlling the braking force at each wheel will redistribute the force when informed that one of their number has failed). The variations on master/shadow described above all depend on a failure notication, or equivalently a membership service. The crucial requirement on such a service is that it must produce consistent knowledge: that is, if one nonfaulty node thinks that a particular node has failed, then all other nonfaulty nodes must hold the same opinionotherwise, the system will lose coordination, with potentially catastrophic results (e.g., if the nodes controlling braking at different wheels make different adjustments to their braking force based on different assessments of which others have failed). Notice that this must also apply to a node's knowledge of its own status: a nave view might assume that a node that is receiving messages and seeing no problems in its own operation should assume it is in the membership. But if this node is unable to transmit, all other nodes will have removed it from their memberships and will be making suitable compensation on the assumption that this node has entered its blackout mode (and is, for example, applying no force to its brake). It could be catastrophic if this node does not adopt the consensus view and continues operation (e.g., applying force to its brake) based on its local assessment of its own health. A membership service operates as follows. Each node maintains a private membership list, which is intended to comprise all and only the nonfaulty nodes. Since it can take a while to diagnose a faulty node, we have to allow the common membership to contain at most one faulty node. Thus, a membership service must satisfy the following two requirements. Agreement: The membership lists of all nonfaulty nodes are the same. Validity: The membership lists of all nonfaulty nodes contain all nonfaulty nodes and at most one faulty node. These requirements can be achieved only under benign fault hypotheses (it is provably impossible to diagnose an arbitrary-faulty node with certainty). When unable to maintain accurate membership, the best recourse is to maintain agreement, but sacrice validity (nonfaulty nodes that are not in the membership can then attempt to rejoin). This weakened requirement is called clique avoidance [2]. Note that it is quite simple to achieve consistent membership on top of an interactively consistent message service: each node broadcasts its own membership list to every other node, and each node runs a deterministic resolution algorithm on the (identical, by interactive consistency) lists received. Conversely, a membership and clique12 avoidance service can assist the construction of an interactively consistent message service: simply exclude from the membership any node that receives a message different than the majority (TTA does this). 5 Practical Implementations Here, we provide sketches of four bus architectures that provide concrete solutions to the requirements and design challenges outlined in the previous sections. More details are available in a companion report to this paper [17]. All four buses support the timetriggered model of computation, employ fault-tolerant distributed clock synchronization, and use bus guardians or some equivalent mechanism to protect against babbling idiot failure modes. They differ in their fault hypotheses, mechanisms employed, services provided, and in their assurance, performance, and cost. SAFEbus. Honeywell developed SAFEbus TM (the principal designers are Kevin Driscoll and Ken Hoyme [7]) to serve as the core of the Boeing 777 Airplane Information Management System (AIMS) [22], which supports several critical functions, such as cockpit displays and airplane data gateways. The bus has been standardized as ARINC 659 [1] and variations on Honeywell's implementation are being used or considered for other avionics and space applications. It uses a bus interconnect similar to that shown in Figure 1; the interfaces (they are called Bus Interface Units, or BIUs) are duplicated, and the interconnect bus is quad-redundant. Most of the functionality of SAFEbus is implemented in the BIUs, which perform clock synchronization and message scheduling and transmission functions. Each BIU of a pair is a separate FCU and acts as its partner's bus guardian by controlling its access to the interconnect. Each BIU of a pair drives a different pair of interconnect buses but is able to read all four; the interconnect buses themselves each comprise two data lines and one clock line and operate at 30MHz. The bus lines and their drivers have the electrical characteristics of OR-gates (i.e., if several different BIUs drive the same line at the same time, the resulting signal is the OR of the separate inputs). Some of the protocols exploit this property; in particular, clock synchronization is achieved using an event-based algorithm. The paired BIUs at sender and receiver, and the quad-redundant buses, provide sufcient redundancy for SAFEbus to provide interactively consistent message broadcasts (in the Honeywell implementation) using an approach similar to that described by Davies and Wakerly [4] (this remarkably prescient paper anticipated many of the issues and solutions in Byzantine fault tolerance by several years). It also supports application-level fault tolerance (based on self-checking pairs) by providing automatic rapid rollover from masters to shadows. Its fault hypothesis includes arbitrary faults, faults in several nodes (but only one per node), and a high rate of fault arrivals. It never gives up and has a well-dened restart and recovery strategy from fault arrivals that exceed its fault hypothesis. It tolerates spatial proximity faults in the AIMS application by duplicating the entire system. SAFEbus is certied for use in passenger aircraft and has extensive eld experience in the Boeing 777. The Honeywell implementation is supported by an in-house tool chain. 13 SAFEbus is the most mature of the four buses considered, and makes the fewest compromises. But because each of its major components is paired (and its bus requires separate lines for clock and data), it is the most expensive of those available for commercial use (typically, a few hundred dollars per node). TTA. The Time Triggered Architecture (TTA) was developed by Hermann Kopetz and colleagues at the Technical University of Vienna [8]. Commercial development of the architecture is undertaken by TTTech and it is being deployed for safety-critical applications in cars by Audi and Volkswagen, and for ight-critical functions in aircraft and aircraft engines by Honeywell. Current implementations of TTA use a bus interconnect similar to that shown in Figure 1. The interfaces (they are called controllers) implement the TTP/C protocol [24] that is at the heart of TTA, providing clock synchronization, and message sequencing and transmission functions. The interconnect bus is duplicated and each controller drives both of them through partially independent bus guardians. TTA uses an averaging clock synchronization algorithm based on that of Lundelius and Lynch [25]. This algorithm is implemented in the controllers, but requires too many resources to be replicated in the bus guardians. The guardians, which have independent clocks, therefore rely on their controllers for a start of frame signal. This compromises their independence somewhat (they also share the power supply and some other resources with their controllers), so forthcoming implementations of TTA use a star interconnect similar to that shown in Figure 2. Here, the guardian functionality is implemented in the central hub which is fully independent of the controllers: the hubs and controllers comprise separate FCUs in this implementation. Hubs are duplicated for fault tolerance and located apart to withstand spatial proximity faults. They also perform signal reshaping to reduce the incidence of SOS faults. TTA employs algorithms for group membership and clique avoidance [2]; these enable its clock synchronization algorithm to tolerate multiple faults (by reconguring to exclude faulty members) and combine with its use of checksums (which can be considered as digital signatures) to provide a form of interactively consistent message broadcasts. The membership service supports application-level fault tolerance based on master-backup or compensation. Proposed extensions provide state machine replication in a manner that is transparent to applications. The fault hypothesis of TTA includes arbitrary faults, and faults in several nodes (but only one per node), provided these arrive at least two rounds apart (this allows the membership algorithm to exclude the faulty node). It never gives up and has a welldened restart and recovery strategy from fault arrivals that exceed this hypothesis. The prototype implementations of TTA have been subjected to extensive testing and fault injections, and deployed in experimental vehicles. Several of its algorithms have been formally veried [13, 14], and aircraft applications under development are planned to lead to FAA certication. It is supported by an extensive tool suite that interfaces to standard CAD environments (e.g., Matlab/Simulink and Beacon). Current implementations provide 25 Mbit/s data rates; research projects are designing implementations for gigabit rates. TTA controllers and the star hub (which is basically a modied controller) are quite simple and cheap to produce in volume. 14 Of the architectures considered here, TTA is unique in being used for both automobile applications, where volume manufacture leads to very low prices, and aircraft, where a mature tradition of design and certication for ight-critical electronics provides strong scrutiny of arguments for safety. SPIDER. A Scalable Processor-Independent Design for Electromagnetic Resilience (SPIDER) is being developed by Paul Miner and colleagues at the NASA Langley Research Center as a research platform to explore recovery strategies for radiation-induced (HIRF/EMI) faults, and to serve as a case study to exercise the recent design assurance guidelines for airborne electronic hardware (DO-254) [15]. SPIDER uses a star conguration similar to that shown in Figure 2, in which the interfaces (called BIUs) may be located either with their hosts or in the centralized hub, which also contains active elements called Redundancy Management Units, or RMUs. Clock synchronization and other services of SPIDER are achieved by novel distributed algorithms executed among the BIUs and RMUs [11]. The services provided include interactively consistent message broadcasts, and identication of failed nodes (from which a membership service can easily be synthesized). SPIDER's fault hypothesis uses a hybrid fault model, which includes arbitrary faults, and allows some combinations of multiple faults. Its algorithms are novel and highly efcient and are being formally veried. SPIDER is an interesting design that uses a different topology and a different class of algorithms from the other buses considered here. However, it is a research project whose design and implementation are still in progress and so it cannot be compared directly with the commercial products. FlexRay. A consortium including BMW, DaimlerChrysler, Motorola, and Philips, is developing FlexRay for powertrain and chassis control in cars. It differs from the other buses considered here in that its operation is divided between time-triggered and eventtriggered activities. Published descriptions of the FlexRay protocols and implementation are sketchy at present [3] (see also the Web site www.flexray-group.com). FlexRay can use either an active star interconnect similar to that shown in Figure 2, or a passive bus similar to that shown in Figure 1. In both cases, duplication of the interconnect is optional. The star conguration of FlexRay (and also that of TTA) can also be deployed in distributed congurations where subsystems are connected by hubto-hub links. Each FlexRay interface (it is called a communication controller) drives the lines to its interconnects through separate bus guardians located with the interface. (This means that with two buses, each node has three clocks: one for the controller and one for each of the two guardians; this differs from the bus conguration of TTA where there is one clock for the controller and both guardians share a second clock.) Like the bus conguration of TTA, the guardians of FlexRay are not fully independent of their controllers. FlexRay aims to be more exible than the other buses considered here, and this seems to be reected in the choice of its name. As noted, one manifestation of this exibility is its combination of time- and event-triggered operation. FlexRay partitions each time cycle into a static time-triggered portion, and a dynamic event-triggered portion. The division between the two portions is set at design time and loaded into 15 the controllers and bus guardians. Communication during the event-driven portion of the cycle uses the Byteight protocol. Unlike SAFEbus and TTA, FlexRay does not install the full schedule for the time-triggered portion in each controller. Instead, this portion of the cycle is divided into a number of slots of xed size and each controller and its bus guardians are informed only of those slots allocated to their transmissions (nodes requiring greater bandwidth are assigned more slots than those that require less). Controllers learn the full schedule only when the bus starts up. Each node includes its identity in the messages that it sends; during startup, nodes use these identiers to label their input buffers as the schedule reveals itself (e.g., if the messages that arrive in slots 1 and 7 carry identier 3, then all nodes will thereafter deliver the contents of buffers 1 and 7 to the task that deals with input from node 3). There appears to be a vulnerability here: a faulty node could masquerade as another (i.e., send a message with the wrong identier) during startup and thereby violate partitioning for the remainder of the mission. It is not clear how this fault mode is countered. Like TTA, FlexRay uses the Lundelius-Welch clock synchronization algorithm but, unlike TTA, it does not use a membership algorithm to exclude faulty nodes. FlexRay provides no services to its applications beyond best-efforts message delivery; in particular, it does not provide interactively consistent message broadcasts. This means that all mechanisms for fault-tolerant applications must be provided by the applications programs themselves. Published descriptions of FlexRay do not specify its fault hypothesis, and it appears to have no mechanisms to counter certain fault modes (e.g., SOS faults or other sources of asymmetric broadcasts, and masquerading on startup). A never-give-up strategy has not been described, nor have systematic or formal approaches to assurance and certication. FlexRay is interesting because of its mixture of time- and event-triggered operation, and potentially important because of the industrial clout of its developers. Currently, it is the slowest of the commercial buses, with a claimed data rate of no more than 10 Mbit/s. 6 Summary and Conclusion A safety-critical bus architecture provides certain properties and services that assist in construction of safety-critical systems. As with any system framework or middleware package, these buses offer a tradeoff to system developers: they provide a coherent collection of services, with strong properties and highly assured implementations, but developers must sacrice some design freedom to gain the full benet of these services. For example, all these buses use a time-triggered model of computation, and system developers must build their applications within that framework. In return, the buses are able to guarantee strong partitioning: faults in individual components or applications (functions in avionics terms) cannot propagate to others, nor can they bring down the entire bus (within the constraints of its fault hypothesis). Partitioning is the minimum requirement, however. It ensures that one failed function will not drag down others, but in many safety-critical systems the failure of even a single function can be catastrophic, so the individual functions must themselves be made fault tolerant. Accordingly, most of the buses provide mechanisms to assist the 16 development of fault-tolerant applications. The key requirement here is interactively consistent message transfer: this ensures that all masters and shadows (or masters and monitors), or all members of a voting pool, maintain consistent state. Three of the buses considered here provide this basic service; some of them do so in association with other services, such master/shadow rollover or group membership, that can be provided with much increased efciency and much reduced latency when implemented at a low level. FlexRay, alone, provides none of these services. In their absence, all mechanisms for fault-tolerance must be implemented in applications programs. Thus, application programmers, who may have little experience in the subtleties of fault-tolerant systems, become responsible for the design, implementation, and assurance of very delicate mechanisms with no support from the underlying bus architecture. Not only does this increase the cost and difculty of making sure that things are done right, it also increases their computational cost and latency. For example, in the absence of an interactively consistent message service provided by the architecture, applications programs must explicitly transmit the multiple rounds of cross-comparisons that are needed to implement this service at a higher level, thereby substantially increasing the message load. Such a cost will invite inexperienced developers to seek less expensive ways to achieve fault tolerancein probable ignorance of the impossibility results in the theoretical literature, and the history of intractable Heisenbugs (rare, unrepeatable, failures) encountered by practitioners who pushed for 10 9 with inadequate foundations. It is unlikely that any single bus architecture will satisfy all needs and markets, so it is possible that FlexRay's lack of application-level fault-tolerant services will nd favor in some areas. It is also to be expected that new or modied architectures will emerge to satisfy new markets and requirements. (For example, it is proposed that TTA could match FlexRay's ability to support event-triggered as well as time-triggered communications by allocating certain time slots to a simulation of CAN; the simulation is actually faster than a real CAN bus, while retaining all the safety attributes of TTA.) I hope that the description provided here will help potential users to evaluate existing architectures against their own needs, and that it will help designers of new architectures to learn from and build on the design choices made by their predecessors. References 1. ARINC Specication 659: Backplane Data Bus. Aeronautical Radio, Inc, Annapolis, MD, December 1993. Prepared by the Airlines Electronic Engineering Committee. 2. G unther Bauer and Michael Paulitsch. An investigation of membership and clique avoidance in TTP/C. In 19th Symposium on Reliable Distributed Systems, Nuremberg, Germany, October 2000. 3. Joef Berwanger et al. FlexRaythe communication system for advanced automotive control systems. In SAE 2001 World Congress, Society of Automotive Engineers, Detroit, MI, April 2001. Paper number 2001-01-0676. 4. Daniel Davies and John F. Wakerly . Synchronization and matching in redundant systems. IEEE Transactions on Computers, C-27(6):531539, June 1978. 5. System Design and Analysis. Federal Aviation Administration, June 21, 1988. Advisory Circular 25.1309-1A. 17 6. G unter Heiner and Thomas Thurner. Time-triggered architecture for safety-related distributed real-time systems in transportation systems. In Fault Tolerant Computing Symposium 28, pages 402407, IEEE Computer Society, Munich, Germany, June 1998. 7. Kenneth Hoyme and Kevin Driscoll. SAFEbus TM . In 11th AIAA/IEEE Digital Avionics Systems Conference, pages 6873, Seattle, WA, October 1992. 8. Hermann Kopetz and G unter Gr unsteidl. TTPa protocol for fault-tolerant real-time systems. IEEE Computer, 27(1):1423, January 1994. 9. Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine Generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382401, July 1982. 10. Dale A. Mackall. Development and ight test experiences with a ight-crucial digital control system. NASA Technical Paper 2857, NASA Ames Research Center, Dryden Flight Research Facility, Edwards, CA, 1988. 11. Paul S. Miner. Analysis of the SPIDER fault-tolerance protocols. In C. Michael Holloway, editor, LFM 2000: Fifth NASA Langley Formal Methods Workshop, NASA Langley Research Center, Hampton, VA, June 2000. Slides available at http://shemesh.larc.nasa. gov/fm/Lfm2000/Presentations/lfm2000-spider/. 12. M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults. Journal of the ACM, 27(2):228234, April 1980. 13. Holger Pfeifer. Formal verication of the TTA group membership algorithm. In Tommaso Bolognesi and Diego Latella, eds., Formal Description Techniques and Protocol Specication, Testing and Verication FORTE XIII/PSTV XX 2000, pages 318, Pisa, Italy, Oct. 2000. 14. Holger Pfeifer, Detlef Schwier, and Friedrich W. von Henke. Formal verication for timetriggered clock synchronization. In Charles B. Weinstock and John Rushby, eds., Dependable Computing for Critical Applications7, Volume 12 of IEEE Computer Society Dependable Computing and Fault Tolerant Systems, pages 207226, San Jose, CA, Jan. 1999. 15. DO254: Design Assurance Guidelines for Airborne Electronic Hardware. Requirements and Technical Concepts for Aviation, Washington, DC, April 2000. 16. John Rushby. A formally veried algorithm for clock synchronization under a hybrid fault model. In Thirteenth ACM Symposium on Principles of Distributed Computing, pages 304 313, Association for Computing Machinery, Los Angeles, CA, August 1994. Also available as NASA Contractor Report 198289. 17. John Rushby. A comparison of bus architectures for safety-critical embedded systems. Technical report, Computer Science Laboratory, SRI International, Menlo Park, CA, June 2001. Available at http://www.csl.sri.com/rushby/papers/buscompare.pdf. 18. Fred B. Schneider. Understanding protocols for Byzantine clock synchronization. Technical Report 87-859, Department of Computer Science, Cornell University, Ithaca, NY, Aug. 1987. 19. Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys, 22(4):299319, December 1990. 20. Marco Schneider. Self stabilization. ACM Computing Surveys, 25(1):4567, March 1993. 21. T. K. Srikanth and Sam Toueg. Optimal clock synchronization. Journal of the ACM, 34(3):626645, July 1987. 22. William Sweet and Dave Dooling. Boeing's seventh wonder. IEEE Spectrum, 32(10):2023, October 1995. 23. Philip Thambidurai and You-Keun Park. Interactive consistency with multiple failure modes. In 7th Symposium on Reliable Distributed Systems, pages 93100, IEEE Computer Society, Columbus, OH, October 1988. 24. Specication of the TTP/C Protocol. Time-Triggered Technology TTTech Computertechnik AG, Vienna, Austria, July 1999. 25. J. Lundelius Welch and N. Lynch. A new fault-tolerant algorithm for clock synchronization. Information and Computation, 77(1):136, April 1988. 18