SRI International Computer Science Laboratory

Anomalies in Digital Flight Control Systems (DFCS)

These notes are excerpted from Formal Methods and the Certification of Critical Systems by John Rushby, SRI-CSL-93-07, November 1993. A book based on this material will be published by Cambridge University Press in 1995.

AFTI-F16

The flight tests of the experimental Advanced Fighter Technology Integration (AFTI) F16 were conducted by NASA, and are unusually well-documented (IRM84,Mac88). The AFTI-F16 had a triple-redundant digital flight-control system (DFCS), with an analog backup. The DFCS had different control modes optimized for air-to-air combat and air-to-ground attack. The Stores Management System (SMS) was responsible for signaling requests for mode change to the DFCS. On flight test 15, an unknown failure in the SMS caused it to request mode changes 50 times a second. The DFCS could not keep up, but responded at a rate of 5 mode changes per second. The pilot reported that the aircraft felt like it was in turbulence; subsequent analysis showed that if the aircraft had been maneuvering at the time, the DFCS would have failed.

The DFCS of the AFTI-F16 employed an ``asynchronous'' design. In such designs, the redundant channels run fairly independently of each other: each computer samples sensors independently, evaluates the control laws independently, and sends its actuator commands to an averaging or selection component that drives the actuator concerned. Because the unsynchronized individual computers may sample sensors at slightly different times, they can obtain readings that differ quite appreciably from one another. The gain in the control laws can amplify these input differences to provide even larger differences in the results submitted to the output selection algorithm. During ground qualification of the AFTI-F16, it was found that these differences sometimes resulted in a channel being declared failed when no real failure had occurred (p. 478 of Mac84). Accordingly, a rather wide spread of values must be accepted by the threshold algorithms that determine whether sensor inputs and actuator outputs are to be considered ``good.'' For example, the output thresholds of the AFTI-F16 were set at 15% plus the rate of change of the variable concerned; in addition, the gains in the control laws were reduced. This increases the latency for detection of faulty sensors and channels, and also allows a failing sensor to drag the value of any averaging functions quite a long way before it is excluded by the input selection threshold; at that point, the average will change with a thump that could have adverse effects on the handling of the aircraft (Figure 20 of Mac88).

An even more serious shortcoming of asynchronous systems arises when the control laws contain decision points. Here, sensor noise and sampling skew may cause independent channels to take different paths at the decision points and to produce widely divergent outputs. This occurred on Flight 44 of the AFTI-F16 flight tests (p. 44 of Mac88). Each channel declared the others failed; the analog back-up was not selected because the simultaneous failure of two channels had not been anticipated and the aircraft was flown home on a single digital channel. Notice that all protective redundancy had been lost, and the aircraft was flown home in a mode for which it had not been designed---yet no hardware failure had occurred.

Another illustration is provided by a 3-second ``departure'' on Flight 36 of the AFTI-F16 flight tests, during which sideslip exceeded 20 degrees, normal acceleration exceeded first -4g, then +7g, angle of attack went to -10 degrees, then +20 degrees, the aircraft rolled 360 degrees, the vertical tail exceeded design load, all control surfaces were operating at rate limits, and failure indications were received from the hydraulics and canard actuators. The problem was traced to a fault in the control laws, but subsequent analysis showed that the side air-data probe was blanked by the canard at the high angle of attack and sideslip achieved during the excursion; the wide input threshold passed the incorrect value through, and different channels took different paths through the control laws. Analysis showed this would have caused complete failure of the DFCS and reversion to analog backup for several areas of the flight envelope (pp.41--42 of Mac88).

Several other difficulties and failure indications on the AFTI-F16 were traced to the same source: asynchronous operation allowing different channels to take different paths at certain selection points. The repair was to introduce voting at some of these ``software switches.'' (The problems of channels diverging at decision points, and also the thumps caused as channels and sensors are excluded and later readmitted by averaging and selection algorithms, are sometimes minimized by modifying the control laws to amp in and out more smoothly in these cases. However, modifying control laws can bring other problems in its train and raises further validation issues.) In one particular case, repeated channel failure indications in flight were traced to a roll-axis software switch. It was decided to vote the switch (which, of course, required ad hoc synchronization) and extensive simulation and testing were performed on the changes necessary to achieve this. On the next flight, the problem was there still. Analysis showed that although the switch value was voted, it was the unvoted value that was used (p. 38 of Mac88). (This bug is an illuminating example. At first, it looks like programming slip---the sort of late-lifecycle fault that was earlier claimed to be very reliably eliminated by conventional V&V. Further thought, however, shows that it is really a manifestation of a serious design oversight in the early lifecycle (the requirement to synchronize channels at decision points in the control laws) that has been kludged late in lifecycle.)

The AFTI-F16 flight tests revealed numerous other problems of a similar nature. Summarizing, Mackall, the engineer who conducted the flight-test program, writes (pp. 40--41 of Mac88):

``The criticality and number of anomalies discovered in flight and ground tests owing to design oversights are more significant than those anomalies caused by actual hardware failures or software errors.

``...qualification of such a complex system as this, to some given level of reliability, is difficult ... [because] the number of test conditions becomes so large that conventional testing methods would require a decade for completion. The fault-tolerant design can also affect overall system reliability by being made too complex and by adding characteristics which are random in nature, creating an untestable design.

``As the operational requirements of avionics systems increase, complexity increases... If the complexity is required, a method to make system designs more understandable, more visible, is needed.

``... The asynchronous design of the [AFTI-F16] DFCS introduced a random, unpredictable characteristic into the system. The system became untestable in that testing for each of the possible time relationships between the computers was impossible. This random time relationship was a major contributor to the flight test anomalies. Adversely affecting testability and having only postulated benefits, asynchronous operation of the DFCS demonstrated the need to avoid random, unpredictable, and uncompensated design characteristics.'' Clearly, much of Mackall's criticism is directed at the consequences of the asynchronous design of the AFTI-F16 DFCS. Beyond that, however, I think the really crucial point is that captured in the phrase ``random, unpredictable characteristics.'' Surely, a system worthy of certification in the ultra-dependable region should have the opposite properties---should, in fact, be predictable: that is, it should be possible to achieve a comprehensive understanding of all its possible behaviors. What other basis for an ``engineering judgment'' that a system is fit for its purpose can there be, but a complete understanding of how the thing works and behaves? Furthermore, for the purpose of certification, that understanding must be communicated to others---if you understand why a thing works as it should, you can write it down, and others can see if they agree with you. Of course, writing down how something as complicated as how a fault-tolerant flight-control system works is a formidable task---and one that will only be feasible if the system is constructed on rational principles, with aggressive use of abstraction, layering, information-hiding, and any other technique that can advance the intellectual manageability of the task. This calls strongly for an architecture that promotes separation of concerns (whose lack seems to be the main weakness of asynchronous designs), and for a method of description that exposes the rationale for design decisions and that allows, in principle, the behavior of the system to be calculated (i.e., predicted or, in the limit, proved). It is, in my view, in satisfying this need for design descriptions which, in principle at least, would allow properties of the designs to be proved, that formal methods can make their strongest contribution to quality assurance for ultra-dependable systems: they address (as nothing else does) Mackall's plea for ``a method to make system designs more understandable, more visible.''

The AFTI-F16 flight tests are unusually well documented; I know of no other flight-control system for which comparable data are publicly available. However, press accounts and occasional technical articles reinforce the AFTI-F16 data by suggesting that timing, redundancy management, and coordination of replicated computing channels are tricky problems that are routinely debugged during flight test.

X-29

The danger of wide sensor selection thresholds is illustrated by a problem discovered in the X29A. This aircraft has three sources of air data: a nose probe and two side probes. The selection algorithm used the data from the nose probe, provided it was within some threshold of the data from both side probes. The threshold was large to accommodate position errors in certain flight modes. It was discovered in simulation that if the nose probe failed to zero at low speed, it would still be within the threshold of correct readings, causing the aircraft to become unstable and ``depart.'' Although this fault was found in simulation, 162 flights had been at risk before it was detected (MA89).

HiMAT

During flight tests of the HiMAT remotely piloted vehicle, an anomaly occurred that resulted in the aircraft landing with its landing skids retracted. ``The anomaly was caused by a timing change made in the ground-based system and the onboard software for uplinking the [landing] gear deployment command. This coupled with the on-board failure of one uplink receiver to cause the anomaly. The timing change was thoroughly tested with the on-board flight software for unfailed conditions. However, the flight software operated differently when an uplink failure was present'' (page 112 of MA89).

X-31

In the flight tests of the X31 the control system ``went into a reversionary mode four times in the first nine flights, usually due to disagreement between the two air-data sources. The air data logic dates back to the mid-1960s and had a divide-by-zero that occurred briefly. This was not a problem in its previous application, but the X31 flight-control system would not tolerate it.'' (Dor91). It seems that either a potentially dangerous condition (i.e., divide-by-zero) had been present but undetected in the previous application, or it was known (and known not to be dangerous in that application) but undocumented. In either case, it seems to indicate inadequate assurance. This example also points to one of the perils of reuse: just because a component worked in a previous application, you cannot assume it will work in a new one unless all the relevant characteristics and assumptions are known and taken into account.

C-17

The C17 has a quad-redundant digital flight-control system(KSQ92). During the initial flight test of the C17 ``On three occasions, warning/caution lights in the cockpit annunciated that flight-control computer (FCC) `dropoffs' had occurred... FCC 3 dropped offline twice, and both FCC 3 and FCC 4 dropped off at the same time once''(Sco92). For an account of software engineering and management on the C17, see(GAO92).

YC-14

A significant software fault was discovered in flight testing the YC-14. ``The error, which caused mistracking of the control-law computation in the three channels, was the result of incorrect use of cross-channel data for one parameter. Each synchro output was multiplied in software by a factor equal to the ratio of the nominal reference voltage to the actual reference voltage. Both the synchro outputs and the reference voltages were transmitted between channels, and the three inputs would be compensated in each channel prior to signal selection. However, because of an error in timing, each channel was using the current correction factor for its own sensor, whereas the correction factors for the other two sensors were from the previous frame. Thus, each channel performed signal selections on a different set of values, resulting in different selected input data for the three channels. Although the discrepancies were small, the effect of threshold detectors and integrators led to large mistracking between channels during flight. In the laboratory, the variations in the simulated synchro reference voltages were sufficiently small that this error would not be detected unless a bit-by-bit comparison between channels had been made'' (MG78).

XXXX

One of the purposes of flight test is to uncover problems, and so the discovery of those just described can be considered a vindication of the value of flight test. Some might even consider these problems merely a matter of tuning, and regard their identification and repair during flight test as the proper course. (For example: ``the FMS of the A320 `was still revealing software bugs until mid-January,' according to Gerard Guyot (Airbus test and development director). There was no particular type of bug in any particular function, he says. `We just had a lot of flying to do in order to check it all out. Then suddenly it was working,' he says with a grin'' (Lea88).) Others might argue the opposite point of view: flight test is for evaluating and tuning handling and controls, and the discovery of basic software problems indicates that the traditional methods of assurance are seriously deficient. Whatever view is taken of the seriousness of these problems, the salient fact seems to be that software problems discovered in flight test often concern redundancy management, coordination, and timing.

Sources

(Dor91): Michael A. Dornheim. X-31 flight tests to explore combat agility to 70 deg. AOA. Aviation Week and Space Technology, pages 38--41, March 11, 1991.
(GAO92): Embedded Computer Systems: Significant Software Problems On C-17 Must Be Addressed. United States General Accounting Office, Washington, DC, May 1992. GAO/IMTEC-92-48.
(IRM84): Stephen D. Ishmael, Victoria A. Regenie, and Dale A. Mackall. Design implications from AFTI/F16 flight test. NASA Technical Memorandum 86026, NASA Ames Research Center, Dryden Flight Research Facility, Edwards, CA, 1984.
(KSQ92): Brian W. Kowal, Carl J. Scherz, and Richard Quinliven. C-17 flight control system overview. IEEE Aerospace and Electronic Systems Magazine, 7(7):24--31, July 1992.
(Lea88): David Learmount. A320 certification: The quiet revolution. Flight International, pages 21--24, February 27, 1988.
(MA89): Dale A. Mackall and James G. Allen. A knowledge-based system design/information tool for aircraft flight control systems. In AIAA Computers in Aerospace Conference VII, pages 110--125, Monterey, CA, October 1989. Collection of Technical Papers, Part 1.
(Mac84): Dale A. Mackall. AFTI/F-16 digital flight control system experience. In Gary P. Beasley, editor, NASA Aircraft Controls Research 1983, pages 469--487. NASA Conference Publication 2296, 1984. Proceedings of workshop held at NASA Langley Research Center, October 25--27, 1983.
(Mac88): Dale A. Mackall. Development and flight test experiences with a flight-crucial digital control system. NASA Technical Paper 2857, NASA Ames Research Center, Dryden Flight Research Facility, Edwards, CA, 1988.
(MG78): D. L. Martin and D. Gangsaas. Testing the YC-14 flight control system software. AIAA Journal of Guidance and Control, 1(4):242--247, July--August 1978.
(Sco92): William B. Scott. C-17 first flight triggers Douglas/Air Force test program. Aviation Week and Space Technology, page 21, September 23, 1992.

John Rushby: Rushby@csl.sri.com