SRI International Computer Science Laboratory

Anomalies in Digital Flight Control Systems (DFCS)

These notes are excerpted from Formal
Methods and the Certification of Critical Systems by John Rushby,
SRI-CSL-93-07, November 1993. A book based on this material will be
published by Cambridge University Press in 1995.
AFTI-F16
The flight tests of the experimental Advanced Fighter Technology
Integration (AFTI) F16 were conducted by NASA, and are unusually
well-documented (IRM84,Mac88). The AFTI-F16 had
a triple-redundant digital flight-control system (DFCS), with an
analog backup. The DFCS had different control modes optimized
for air-to-air combat and air-to-ground attack. The Stores
Management System (SMS) was responsible for signaling requests for
mode change to the DFCS. On flight test 15, an unknown failure in
the SMS caused it to request mode changes 50 times a second. The
DFCS could not keep up, but responded at a rate of 5 mode changes per
second. The pilot reported that the aircraft felt like it was in
turbulence; subsequent analysis showed that if the aircraft had been
maneuvering at the time, the DFCS would have failed.
The DFCS of the AFTI-F16 employed an ``asynchronous'' design. In
such designs, the redundant channels run fairly independently of each
other: each computer samples sensors independently, evaluates the
control laws independently, and sends its actuator commands to an
averaging or selection component that drives the actuator concerned.
Because the unsynchronized individual computers may sample
sensors at slightly different times, they can obtain readings that
differ quite appreciably from one another. The gain in the control
laws can amplify these input differences to provide even larger
differences in the results submitted to the output selection
algorithm. During ground qualification of the AFTI-F16, it was found
that these differences sometimes resulted in a channel being declared
failed when no real failure had occurred (p. 478 of
Mac84).
Accordingly, a rather wide spread of values must be accepted by the
threshold algorithms that determine whether sensor inputs and actuator
outputs are to be considered ``good.'' For example, the output
thresholds of the AFTI-F16 were set at 15% plus the rate of change of
the variable concerned; in addition, the gains in the control laws
were reduced. This increases the latency for detection of faulty
sensors and channels, and also allows a failing sensor to drag the
value of any averaging functions quite a long way before it is
excluded by the input selection threshold; at that point, the average
will change with a thump that could have adverse effects on the
handling of the aircraft (Figure 20 of Mac88).
An even more serious shortcoming of asynchronous systems arises when
the control laws contain decision points. Here, sensor noise and
sampling skew may cause independent channels to take different paths
at the decision points and to produce widely divergent outputs. This
occurred on Flight 44 of the AFTI-F16 flight tests (p.
44 of Mac88). Each channel declared the others failed; the analog
back-up was not selected because the simultaneous failure of two
channels had not been anticipated and the aircraft was flown home on
a single digital channel. Notice that all protective redundancy had
been lost, and the aircraft was flown home in a mode for which it had
not been designed---yet no hardware failure had
occurred.
Another illustration is provided by a 3-second ``departure'' on
Flight 36 of the AFTI-F16 flight tests, during which sideslip
exceeded 20 degrees, normal acceleration exceeded first -4g, then
+7g, angle of attack went to -10 degrees, then +20 degrees, the
aircraft rolled 360 degrees, the vertical tail exceeded design
load, all control surfaces were operating at rate limits, and failure
indications were received from the hydraulics and canard actuators.
The problem was traced to a fault in the control laws, but
subsequent analysis showed that the side air-data probe was blanked
by the canard at the high angle of attack and sideslip achieved
during the excursion; the wide input threshold passed the incorrect
value through, and different channels took different paths through the
control laws. Analysis showed this would have caused complete
failure of the DFCS and reversion to analog backup for several areas
of the flight envelope (pp.41--42 of Mac88).
Several other difficulties and failure indications on the AFTI-F16
were traced to the same source: asynchronous operation allowing
different channels to take different paths at certain selection
points. The repair was to introduce voting at some of these
``software switches.'' (The problems of channels diverging at
decision points, and also the thumps caused as channels and sensors
are excluded and later readmitted by averaging and selection
algorithms, are sometimes minimized by modifying the control laws to
amp in and out more smoothly in these cases. However, modifying
control laws can bring other problems in its train and raises further
validation issues.) In one particular case, repeated channel failure
indications in flight were traced to a roll-axis software switch.
It was decided to vote the switch (which, of course, required ad
hoc synchronization) and extensive simulation and testing were
performed on the changes necessary to achieve this. On the next
flight, the problem was there still. Analysis showed that
although the switch value was voted, it was the unvoted value that was
used (p. 38 of Mac88). (This bug is an
illuminating
example. At first, it looks like programming slip---the sort of
late-lifecycle fault that was earlier claimed to be very reliably
eliminated by conventional V&V. Further thought, however, shows
that it is really a manifestation of a serious design oversight in the
early lifecycle (the requirement to synchronize channels at decision
points in the control laws) that has been kludged late in lifecycle.)
The AFTI-F16 flight tests revealed numerous other problems of a
similar nature. Summarizing, Mackall, the engineer who conducted the
flight-test program, writes (pp. 40--41 of Mac88):
``The criticality and number of anomalies discovered in flight and
ground tests owing to design oversights are more significant than
those anomalies caused by actual hardware failures or software errors.
``...qualification of such a complex system as this, to some
given level of reliability, is difficult ... [because] the number
of test conditions becomes so large that conventional testing methods
would require a decade for completion. The fault-tolerant design
can also affect overall system reliability by being made too complex
and by adding characteristics which are random in nature, creating an
untestable design.
``As the operational requirements of avionics systems increase,
complexity increases...
If the complexity is required, a method to make system designs more
understandable, more visible, is needed.
``... The asynchronous design of the [AFTI-F16] DFCS introduced a random,
unpredictable characteristic into the system. The system became
untestable in that testing for each of the possible time relationships
between the computers was impossible. This random time relationship
was a major contributor to the flight test anomalies. Adversely
affecting testability and having only postulated benefits,
asynchronous operation of the DFCS demonstrated the need to avoid
random, unpredictable, and uncompensated design characteristics.''
Clearly, much of Mackall's criticism is directed at the consequences
of the asynchronous design of the AFTI-F16 DFCS. Beyond that,
however, I think the really crucial point is that captured in the
phrase ``random, unpredictable characteristics.'' Surely, a system
worthy of certification in the ultra-dependable region should have
the opposite properties---should, in fact, be predictable:
that is, it should be possible to achieve a comprehensive
understanding of all its possible behaviors. What other basis for an
``engineering judgment'' that a system is fit for its purpose can
there be, but a complete understanding of how the thing works and
behaves? Furthermore, for the purpose of certification, that
understanding must be communicated to others---if you understand why
a thing works as it should, you can write it down, and others can see
if they agree with you. Of course, writing down how something as
complicated as how a fault-tolerant flight-control system works is a
formidable task---and one that will only be feasible if the system is
constructed on rational principles, with aggressive use of
abstraction, layering, information-hiding, and any other technique
that can advance the intellectual manageability of the task. This
calls strongly for an architecture that promotes separation of
concerns (whose lack seems to be the main weakness of asynchronous
designs), and for a method of description that exposes the rationale
for design decisions and that allows, in principle, the behavior of
the system to be calculated (i.e., predicted or, in the limit,
proved). It is, in my view, in satisfying this need for design
descriptions which, in principle at least, would allow properties of
the designs to be proved, that formal methods can make their
strongest contribution to quality assurance for ultra-dependable
systems: they address (as nothing else does) Mackall's plea for ``a
method to make system designs more understandable, more visible.''
The AFTI-F16 flight tests are unusually well documented; I know of no
other flight-control system for which comparable data are publicly
available. However, press accounts and occasional technical articles
reinforce the AFTI-F16 data by suggesting that timing, redundancy
management, and coordination of replicated computing channels are
tricky problems that are routinely debugged during flight test.
X-29
The danger of wide sensor selection thresholds is illustrated by a
problem discovered in the X29A. This aircraft has three sources of
air data: a nose probe and two side probes. The selection algorithm
used the data from the nose probe, provided it was within some
threshold of the data from both side probes. The threshold was large
to accommodate position errors in certain flight modes. It was
discovered in simulation that if the nose probe failed to zero at low
speed, it would still be within the threshold of correct readings,
causing the aircraft to become unstable and ``depart.'' Although
this fault was found in simulation, 162 flights had been at risk
before it was detected (MA89).
HiMAT
During flight tests of the HiMAT remotely piloted vehicle, an
anomaly occurred that resulted in the aircraft landing with its
landing skids retracted. ``The anomaly was caused by a timing change
made in the ground-based system and the onboard software for uplinking
the [landing] gear deployment command. This coupled with the
on-board failure of one uplink receiver to cause the anomaly. The
timing change was thoroughly tested with the on-board flight software
for unfailed conditions. However, the flight software operated
differently when an uplink failure was present'' (page
112 of MA89).
X-31
In the flight tests of the X31 the control system ``went into a
reversionary mode four times in the first nine flights, usually due to
disagreement between the two air-data sources. The air data logic
dates back to the mid-1960s and had a divide-by-zero that occurred
briefly. This was not a problem in its previous application, but the
X31 flight-control system would not tolerate it.'' (Dor91). It
seems that either a potentially dangerous condition (i.e.,
divide-by-zero) had been present but undetected in the previous
application, or it was known (and known not to be dangerous in that
application) but undocumented. In either case, it seems to indicate
inadequate assurance. This example also points to one of the perils
of reuse: just because a component worked in a previous application,
you cannot assume it will work in a new one unless all the
relevant characteristics and assumptions are known and taken into account.
C-17
The C17 has a quad-redundant digital flight-control system(KSQ92). During the initial flight test of
the C17 ``On three occasions, warning/caution lights in the cockpit
annunciated that flight-control computer (FCC) `dropoffs' had
occurred... FCC 3 dropped offline twice, and both FCC 3 and FCC 4
dropped off at the same time once''(Sco92). For an account of software
engineering and management on the C17, see(GAO92).
YC-14
A significant software fault was discovered in flight testing
the YC-14. ``The error, which caused mistracking of the control-law
computation in the three channels, was the result of incorrect use of
cross-channel data for one parameter. Each synchro output was
multiplied in software by a factor equal to the ratio of the
nominal reference voltage to the actual reference voltage. Both the
synchro outputs and the reference voltages were transmitted between
channels, and the three inputs would be compensated in each channel
prior to signal selection. However, because of an error in timing,
each channel was using the current correction factor for its own
sensor, whereas the correction factors for the other two sensors were
from the previous frame. Thus, each channel performed signal
selections on a different set of values, resulting in different
selected input data for the three channels. Although the
discrepancies were small, the effect of threshold detectors and
integrators led to large mistracking between channels during flight.
In the laboratory, the variations in the simulated synchro reference
voltages were sufficiently small that this error would not be detected
unless a bit-by-bit comparison between channels had been
made'' (MG78).
XXXX
One of the purposes of flight test is to uncover problems, and so the
discovery of those just described can be considered a vindication of
the value of flight test. Some might even consider these problems
merely a matter of tuning, and regard their identification and repair
during flight test as the proper course. (For example: ``the
FMS of the A320 `was still revealing software bugs until
mid-January,' according to Gerard Guyot (Airbus test and
development director). There was no particular type of bug in any
particular function, he says. `We just had a lot of flying to do in
order to check it all out. Then suddenly it was working,' he says
with a grin'' (Lea88).) Others might argue the opposite
point of view: flight test is for evaluating and tuning handling and
controls, and the discovery of basic software problems indicates that
the traditional methods of assurance are seriously deficient.
Whatever view is taken of the seriousness of these problems, the
salient fact seems to be that software problems discovered in flight
test often concern redundancy management, coordination, and timing.
Sources
- (Dor91)
- Michael A. Dornheim.
X-31 flight tests to explore combat agility to 70 deg. AOA.
Aviation Week and Space Technology, pages 38--41, March 11,
1991.
- (GAO92)
- Embedded Computer Systems: Significant Software Problems On C-17 Must Be
Addressed.
United States General Accounting Office, Washington, DC, May 1992.
GAO/IMTEC-92-48.
- (IRM84)
- Stephen D. Ishmael, Victoria A. Regenie, and Dale A. Mackall.
Design implications from AFTI/F16 flight test.
NASA Technical Memorandum 86026, NASA Ames Research Center, Dryden
Flight Research Facility, Edwards, CA, 1984.
- (KSQ92)
- Brian W. Kowal, Carl J. Scherz, and Richard Quinliven.
C-17 flight control system overview.
IEEE Aerospace and Electronic Systems Magazine, 7(7):24--31,
July 1992.
- (Lea88)
- David Learmount.
A320 certification: The quiet revolution.
Flight International, pages 21--24, February 27, 1988.
- (MA89)
- Dale A. Mackall and James G. Allen.
A knowledge-based system design/information tool for aircraft flight
control systems.
In AIAA Computers in Aerospace Conference VII, pages 110--125,
Monterey, CA, October 1989.
Collection of Technical Papers, Part 1.
- (Mac84)
- Dale A. Mackall.
AFTI/F-16 digital flight control system experience.
In Gary P. Beasley, editor, NASA Aircraft Controls Research
1983, pages 469--487. NASA Conference Publication 2296, 1984.
Proceedings of workshop held at NASA Langley Research Center, October
25--27, 1983.
- (Mac88)
- Dale A. Mackall.
Development and flight test experiences with a flight-crucial digital
control system.
NASA Technical Paper 2857, NASA Ames Research Center, Dryden Flight
Research Facility, Edwards, CA, 1988.
- (MG78)
- D. L. Martin and D. Gangsaas.
Testing the YC-14 flight control system software.
AIAA Journal of Guidance and Control, 1(4):242--247,
July--August 1978.
- (Sco92)
- William B. Scott.
C-17 first flight triggers Douglas/Air Force test program.
Aviation Week and Space Technology, page 21, September 23,
1992.
John Rushby:
Rushby@csl.sri.com