An Overview of My Papers on Assurance Cases

John Rushby, 2017,
Updated with Speculation on Autonomous Systems February 2019,
And What about the Boeing 737 MAX? March 2019.

I have several papers on assurance cases, some of which share text and diagrams, and you might wonder how they relate to each other, or whether I am plagiarizing myself. So on this page I will try to provide a guide to what I think are the significant aspects of each paper.

History and Overview

The various papers record the evolution of my thinking on the topic of assurance cases. The first is this Technical Report, which is updated from my part of a NASA Contractor Report developed with colleagues at Boeing. I think sections 2, 3, and 4 remain useful; they document, respectively, the origin and history of safety and assurance cases, how a section of DO-178C might be interpreted as an assurance case, and something of the notations and tools available (in 2015). Section 5, on evaluation of assurance cases, is still OK, especially for background on logic and related topics, but my thinking has developed since then.

My current thinking is that the leaves and the interior parts of an assurance case argument (viewed as a tree) should be interpreted differently from each other. The leaves concern evidence about the system and are best evaluated using methods from epistemology, whereas the interior nodes document reasoning (based on the evidence) and should be evaluated by the methods of logic. This idea is first adumbrated in my SafeComp 2013 paper.

Evidential and Reasoning Steps

A detailed proposal and example how evidential steps can be evaluated using the ideas and measures of Bayesian confirmation theory is developed in my AAA 2015 paper, and a comparable treatment for reasoning steps is developed in my Shonan 2016 paper (updated 2019), which builds on the epistemological notion of indefeasibility.

Natural Language Deductivism, NLD

Together, these two approaches provide an interpretation for assurance case arguments that is a systematic version of Natural Language Deductivism (NLD), as documented in my Marktoberdorf 2016 paper. NLD is the idea that an assurance case should provide a deductively valid argument; it differs from deductive proof in formal mathematics and logic in that its premises are "reasonable or plausible" rather than certain, and hence its conclusions are likewise reasonable or plausible rather than certain. The criteria of AAA 2015 systematize what it means for the premises to be "reasonable or plausible." I should note that this interpretation is consistent with that of the Claims-Argument-Evidence (CAE) approach to assurance cases, but not for the Goal-Structuring Notation (GSN).

Note that NLD is not the same as formal verification even though both are tantamount to proof. Formal verification establishes that one formal model (typically for an implementation) is consistent with another (typically for a specification). Within an assurance case, that's an item of evidence. Other parts of the assurance case must establish that the lower model actually conforms to the real implementation and that the upper one accurately formalizes the specification used elsewhere in the case. And yet other parts of the case must ensure that the versions verified are the same as those used, and that our method of verification (tools, theorem prover etc.) are sound, and that our assumptions are discharged etc.

The evidential claims in an assurance case are not formal statements but statements about "big" parts of the overall development: e.g., "this specification is correct wrt. its intended use". And its reasoning steps are not intricate proofs but simple conjunctions that (indefeasibly) entail their parent claim.

From Assurance to Certification: Few Faults to Rare Failures

The NLD interpretation of an assurance case gives us confidence that the system is free of faults but certification is about absence (or rarity) of failures.

However, the degree of confidence in the absence of faults can be expressed as a subjective probability, and a little arithmetic then allows us to calculate the probability that the system will survive n independent demands (e.g., hours of flight) without failure, given some estimate for the probability that it fails if (despite our assurance) it is faulty. Prior experience (e.g., failure-free tests) gives us some handle on the latter, and the overall problem is then one of Bayesian inference: given m successful demands and our confidence (as a probability) that the system is free of faults, what is the probability of surviving n-m future demands without failure? To perform this calculation, we need some prior distribution for the probability of failure if faulty, and this will be hard to obtain and hard to justify. However, Strigini and Povyakalo have shown that there is a worst possible distribution and this allows calculation of estimates that are guaranteed to be conservative. If assurance gives us 90% confidence that the system is free of faults and we have observed m failure-free demands, then we can be reasonably confident of surviving another 10×m demands without failure (whereas without the confidence derived from assurance we can barely expect another m). At the end of that period, we have a lot more observations, and that gives us confidence for another 10-fold increase in exposure and so we can "bootstrap" our way to the (FAA required) lifetime of the fleet. This is a remarkable result (again due to Povyakalo and Strigini plus Bishop) and, in my opinion, the best explanation we have for how assurance actually works.

Assurance 2.0

These ideas have been further devloped in joint work with Robin Bloomfield of Adelard to yield an approach we call Assurance 2.0.

Tool Support and Applications

I am part of a DARPA project with Adelard and UT Dallas and led by Honeywell that is developing tool support and applications for Assurance 2.0.

What to Read (note updates at bottom)

The Technical Report is good for history, while the Marktoberdorf 2016 paper gives the broadest technical overview of these topics--however, it was written before the Shonan 2016 paper and so it is not quite the only thing you need for the full picture. For that, I suggest reading the AAA 2015 and Shonan 2016 papers for detail on evidential and reasoning steps, respectively, and the Marktoberdorf 2016 paper for the overall picture.

Update: The Shonan 2016 paper was heavily revised in 2019, so now I would be inclined to suggest reading that first, followed by the AAA 2015 paper.

More recent Update: This paper on Assurance 2.0 is now the most up to date presentation. But you will still need the other papers for detail.

Even More recent Update: This paper on Confidence in Assurance 2.0 subsumes many others.

 *NEW* Most recent Update: This paper on CLARISSA: Foundations, Tools & Automation for Assurance Cases is now the most recent summary. Start here.

For autonomous systems, see below

Please note that many of the papers here are slightly updated from their published form--so it's always best to get my papers from my bibliography page.

What about the Boeing 737 MAX?

I've noted above and below and elsewhere that software assurance for commercial aircraft is remarkably effective: there have been no crashes or serious incidents due to software. "But", people say, "what about the fatal Lion Air and Ethiopian crashes by 737 MAX aircraft?" Although the accident investigations are still underway, it is generally assumed that a MAX feature called MCAS was the primary cause of both crashes.

First of all, let's understand what MCAS does. It stands for "Maneuvering Characteristic Augmentation System" and is often described as an "anti-stall" device. It is not; it is there to deal with uncertifiable handling characteristics that could be the precursor to a stall. It applies only when flying the plane by hand, and is needed only in rather rare circumstances: light load, rear CG (center of gravity), and high AOA.

AOA is "angle of attack": it is the angle at which the air strikes the wing. It is not the same as pitch angle, which is the angle between the wing and the horizon. AOA can be sensed only by having a vane of some kind out there in the wind.

The 737 is a 60-year old design with stubby undercarriage, so it is difficult to attach modern fuel-efficient high-bypass engines without them scraping along the ground. The MAX gets around this by having the engines on long pylons that hold the engine high up and far in front of the wing. As AOA increases, the aerodynamic lift from the cowling around the engine, located far in front of the centers of lift and gravity, generates an upward pitching moment.

There is a certification requirement that when hand-flying the aircraft, the force on the control column required to pull up the nose must not decrease as AOA increases. This is to be demonstrated at fixed thrust levels so there is no change in thrust pitching moment due to engine power. The problem with 737 MAX is that as AOA increases the lift provided by the engine cowling reduces the column pull needed to maintain a steady positive AOA rate. That is not compliant with the certification requirements, so MCAS was added to put in nose-down force during this maneuver by moving the horizontal stabilizer.

The net effect of engine cowling lift and MCAS nose down stabilizer is that the control column force needed to complete the pitch up maneuver does not decrease part way through the range of AOA for which the certification requirement must be demonstrated.

Now, this is a fairly remote part of the flight envelope (hand flying at rather extreme AOA) and Boeing originally designed MCAS to trigger only when the AOA was past some limit and the g-forces (measured by an accelerometer) also. Pretty safe.

The problem is that Boeing later decided to invoke MCAS in less extreme maneuvers and removed the g-force requirement. So now MCAS is triggered off a single AOA sensor and therefore a faulty one can cause MCAS to activate in routine flight. Because MCAS was thought of as "part of the aircraft" rather than part of the flight control system, Boeing did not disclose its details to pilots, nor provide clear instructions on how to diagnose faults and turn it off.

In both the Lion Air and Ethiopia crashes it seems that a faulty AOA sensor caused MCAS to put the nose down soon after takeoff and the pilots were unable to understand or correct it.

Unlike automated stabilizer trim, which can be overridden and disconnected by pulling back on the control column, MCAS just keeps going and resists, since its entire purpose is to increase column force. In fact, it seems MCAS could put in more stabilizer deflection than generally realized and could essentially drive it hard over. Furthermore, the aerodynamic loads on the stabilizer could then be such that it is difficult or impossible to correct the out of trim condition (and unloading the stabilizer requires near-aerobatic maneuvers that are beyond most pilots).

It should now be clear that the MCAS flaws that led to the crashes are entirely due to to faults in its conception and design, and these have nothing to do with software. Just because much of MCAS is implemented in software does not mean that its flaws indicate failures in software development or assurance: as far as we know, the software behaved correctly according to its requirements.

Nothing excuses Boeing's shocking derogation of basic safety engineering and certification practices, but these did not concern software. Just because the jack screw that moves the horizontal stabilizer was found to be fully extended in the Ethiopia crash, that doesn't mean the jackscrew caused the crash: it did what it was supposed to do, just like the software did. If you know the standards, the flaws were not in application of DO-178C (software), but of ARP 4754A (highly integrated or complex systems).

Where I think it is fair to implicate software is that its very existence allows system designs to become increasingly complex. You simply could not have contemplated the complex lipstick on the pig that turned the 737 into the 737 MAX if software had not been available.

So software enables construction of systems whose complex functionality and design outstrips the ability of its developers to fully understand (or face up to) all the hazards that it may present.

But that is not the same as flaws in software or its assurance.

In fact, I think tools similar to those used for software assurance (model checkers, static analyzers) might help the designers of those highly integrated and complex systems get their requirements right.

What About Autonomous Systems?

The analysis summarized above is focused on traditional systems--quintessentially commercial airplanes--where compromises are made to ensure predictability and determinism, even in the presence of faults. It works well: modern airplanes have no incidents due to software (they do have incidents due to poor requirements that are implemented by software, but those faults must be addressed by improved systems engineering, not software engineering).

But what about new systems, such as self-driving cars and "increasingly autonomous" (IA) airplanes where some of the software is generated by opaque methods like machine learning implemented in deep neural nets? I participated in a NASA study on this topic: Considerations in Assuring Safety of Increasingly Autonomous Systems but my thinking has evolved somewhat since then and is described in this paper on Model-Centered Assurance For Autonomous Systems with Susmit Jha and Shankar.

The Big Picture Remains The Same

I think the overall method for assurance and certification should be the same for autonomous systems as for traditional ones. That is:
  1. We develop an assurance case to provide confidence that our system is free of faults
  2. We apply this in the Bayesian worst-case model of Strigini and Povyakalo
  3. So that failure-free experience (initially tests, then real-life) provides confidence that significantly greater exposure will also be failure-free.
There are some differences between IA airplanes and self-driving cars, but they concern arithmetic, not method. For airplanes, the requirement is no catastrophic failure condition in the entire lifetime of all airplanes of one type. That's a total exposure of about 109 hours, so an acceptable failure rate around 10-9 per hour.

For human-driven cars, the fatal accident rate in the USA is a bit over 10 per billion miles. We intend that self-driving cars should be safer than humans, so a target of 5 per billion miles might be reasonable. That is 5×10-9 per mile or 10-7 per hour, if we assume 20 mph. So that's a factor of about 100 less than airplanes. The main impact is that we do not expect the experience in step 3 to be failure-free, but we do expect it to be within the rate that society considers acceptable, and we do not expect our confidence in assurance at step 1 to be as high as for conventional systems.

But At Present It's All Experience and No Prior Assurance

The traditional aspects of automobile software are assured to very high levels using standards such as ISO 26262, but I am focusing on the non-traditional, autonomous aspects.

Here, it seem that confidence is developed by huge quantities of simulation tests and real experience: this approach is sometimes referred to as "collecting miles". The tests need to cover many possible scenarios and so there is much current research in the specification, generation, and curation of scenarios. One difficulty is that of "fat tails": I heard of one set of scenarios where 26 cases covered half of all pedestrian accidents, and the other half required 5,287 cases.

But more fundamentally, the problem with purely "collecting miles" is that with no prior confidence from assurance, exposure of n units (miles, or hours) gives you merely weak confidence in the safety of another n, whereas with prior confidence from assurance you could be good for as many as 10×n.

Assurance For Autonomous Systems: World Model plus Actions

The argument above says that to get useful confidence in the safety of autonomous systems, we must start off with some prior assurance based on scrutiny of its design and construction, for then (and only then) will we get a useful multiplier on the confidence we get by collecting miles.

For traditional software, there are well established and historically effective methods for developing assurance, as documented in guidelines like ARP 4754A and DO-178C (for aircraft) and ISO 26262 (for cars), or as formulated in assurance cases. But, as we know, autonomous systems are full of deep learning and other elements that are not amenable to traditional assurance, so what can we do?

Typically, the deep learning components are concerned with perception: for an autonomous car, this will include finding the traffic lanes, identifying and locating other cars and obstacles around us, reading road signs and so on. The car uses its interpretation and integration of these perceptions in planning its next steps. The exact path from perception to plan seems to be part of the "secret sauce" of each system.

I propose a somewhat different approach: the task of deep learning and other perception components is to build a model of the (local) world. This is not a new idea: Josh Tenenbaum and his colleagues at MIT and NYU say much the same in their paper "Building Machines that Learn and Think Like People" (PDF). However, Tenenbaum is explicitly concerned with human-like AI and learning, whereas I am interested in assurable autonomy. Tenenbaum takes his inspiration from the way the human brain seems to work, while I think that this is a good design, independently of any biological precedents.

The reason I think this is a good design is that it separates the autonomous system into two simpler parts:

  1. One is concerned with building a model of the world
  2. The other is concerned with planning and delivering actions, based on that model
And its assurance also divides into two parts
  1. How good is the model of the world?
  2. How safe are the actions, given the model?
I'm going to make a big assumption: that the action component is traditional software and can be assured in traditional ways. So now we need only to worry about the model construction.

In the words of my former boss, Brian Randell, "we have succeeded in reducing the problem to a previously unsolved problem". That's progress!

Before I discuss assurance for world models, I'll take an excursion to describe a plausible and attractive architecture for constructing these models. This architecture seems to be the one used by the human brain, but that is not the reason I advocate it: as with model building as the goal, I think there are independent reasons for considering this architecture a good one.

Cognitive Architecture for Model Construction

It's not a new idea that a system's actions are driven by a model: that's the way control systems work, and there's a famous paper by Conant and Ashby: Every Good Regulator of a System Must be a Model of that System. (PDF)

What is new is that the model is built by machine learning.

Now part of the inspiration for machine learning and neural nets is to reproduce in simplified form some of the mechanisms that are believed to operate in the human brain. So let us take this a bit further and look a little more deeply into what is known or conjectured about the operation of the brain.

The best current explanation (in my opinion) for how the brain works is called predictive processing (PP) or sometimes predictive coding or predictive error minimization. I'll explain it in terms of a self-driving car.

One thing the autonomy system of the car has to do is locate its traffic lane and stay in it. A camera is one important element in performing this task. Now, the array of numbers indicating light and color intensities at each pixel of the camera field does not directly indicate traffic lanes. Instead, we have to bring the idea of traffic lanes to the image and invent some algorithm (e.g., based on the Hough transform, invented here at SRI) for interpreting it so that it reveals the lanes.

It is rather wasteful to interpret each camera frame ab-initio--the lane in this frame is surely going to be related to the lane in the previous one--so we could seed our lane detector with information gleaned from previous frames and start our interpretation of this frame with our best guess where the lane is going to be. Not only is this computationally more efficient, it is more accurate and robust (e.g., it can deal with a few frames of scuffed or missing lane markings, where an ab-initio algorithm might fail). There will be uncertainty in our detection of the lanes and we should manage this explicitly and record our confidence in lane location as a probability distribution function (pdf).

In its completed form, this approach to lane detection is a Bayesian estimator: we have a "prior" pdf representing our best estimate (based on previous frames) where we expect the lanes to be, we obtain new information (a new camera frame) and we update our estimate according to Bayes' rule to give us the best "posterior" pdf for the lane locations (so it is like a Kalman filter for images).

The parts of the brain concerned with interpretation of the senses work like this (as first proposed by Helmholtz in the 1860s). Contrary to naive expectations of a bottom-up flow of sense data, there are more neural pathways going from upper to lower levels of the brain than vice versa, and this is because predictions are flowing down, and only corrections are flowing up. The upper levels of the sensory interpretation systems of the brain maintain a best guess (i.e., a model) of the way the world is, and the senses and (approximations to) Bayesian estimation provide corrections that minimize prediction error.

There's a rather difficult extended version of predictive processing known as "Free Energy" ("free" as in "available"; it comes from Hamiltonian mechanics). Here's a hagiography of its founder Karl Friston. Free Energy extends the theory to include actions: the brain "predicts" that the hand, say, is in a certain place and to minimize prediction error the hand actually moves there. Prediction error can be partitioned between that in sensing and that in acting. I have a hunch that my proposal for assurance can be cast in this framework by assuming zero action error (because in this specific context, we can verify correct action, given the model, so all prediction error is in the model).

Assurance for Constructed Models of the World

What I propose is that the sense-interpretation part of an autonomous system should be focused on model building, and that PP is a plausible architecture for accomplishing that.

The idea of a model is that it allows us to perform calculations and experiments: i.e., to make predictions about future courses of action. Provided our model (and method of calculating on it) are accurate, our predictions will be accurate and, presumably, safe.

The attraction of PP is that it allows for integration of different "points of view" so that the model likely to be both more robust and more accurate. For example, above the camera interpretation component of the lane detector we could have a component that observes the locations and movements of the cars around us, and above that we may have mapping and GPS information that tells us where the road is, if not the individual lanes. And if we've driven the road before, some of that experience may be recorded in another component. The goal is to develop a federation of components (it does not have to be a linear hierarchy) that collectively constructs a model of the road and its lanes.

The components that contribute to a PP model need not all be based on perception. Some could be based on general knowledge about the world. For example, if a certain item in the scene is classified as a chicken, that component might check that it is less then 2 feet tall, or if another item is said to be a boat, it could check there is water.

How can we develop assurance that the model constructed by an autonomous system accurately reflects the real world? We do not have much insight into its method of construction (other than selecting the training set) since it is produced by components that were generated by machine learning. So it seems the best we can do are tests and experiments. We have to be cautious about tests, since they can turn into collecting miles and will not deliver the independent assurance we need.

One possibility is to start with a modeled world, generate scenarios from that, run the system under test against those, and compare the world model that it constructs against the one we started with. We can then introduce perturbations (e.g., obscured lane markings) into the scenario generation until the constructed model departs from the original and thereby learn something of its robustness.

The first two steps are the same as for collecting miles, but then they run the complete system and check for hazardous outcomes. In effect, they are confounding the actions and the model, whereas we explicitly focus on the model.

Redundancy: Ensembles and Runtime Monitors

Another possibility is to use redundancy. We could build diverse implementations for some of the components: they would use the same sensors and data sources, but different algorithms and different training data.

There are two ways of using the diverse components: one is to "sum" their outputs, the other is to use them as runtime monitors.

Many of the most effective AI systems (e.g., competition winners) use the "summing" approach to create ensembles built from multiple diverse components. One of the attractions of the PP architecture is that it naturally lends itself to this approach. Both within a level and up and down the hierarchy, different systems and different "points of view" are combined to construct a robust and accurate model of the world.

We can imagine that some of the diverse components are explicitly constructed to deliver a different "point of view": for example, we can imagine a "safety conscious" lane detector that is more conservative in its judgments; that is, the main system has a "best guess" model for where the lanes are, whereas the "safety conscious" component has a higher threshold ("certainty") on a weaker model.

Such "safety conscious" components could alternatively be used as runtime monitors rather than folded into the overall PP architecture: that is actions are derived from the regular model but then checked against the safety model.


This is all highly speculative. I do not know if it really is feasible to derive assurance for autonomous systems in the manner described above, but I do think the ingredients are likely to prove useful: the PP architecture, redundancy with diverse "points of view", runtime monitors, and closed-loop model inspection.

Having trouble reading our papers?
Return to John Rushby's bibliography page
Return to the Formal Methods Program home page
Return to the Computer Science Laboratory home page