11
Jul

Safety Critical Systems – [Jan 10, 2017]

Software is an essential part of many safety-critical systems. Modern cars and aircraft contain dozens of processors and millions of lines of computer software.

This lecture looks at the standards and guidance that are used when regulators certify these systems for use. Do these standards measure up to the recommendations of a report on Certifiably Dependable Software from the US National Academies? Are they based on sound computer science?

Lecture Date: Tuesday, 10th January 2017 – 6:00pm at The Museum of London, 150 London Wall, London, EC2Y 5HN

No reservations are required for this lecture. It will be run on a ‘first come, first served’ basis.
Doors will open 30 minutes before the start of the lecture.

 

DOWNLOADS:

Lecture Transcript (WORD FILE)Lecture Slides (PPT)View/add Comments

2 Responses

  1. Nick James

    Dear Professor Thomas,

    First of all, thank you for a fascinating lecture yesterday. I enjoy all of your lectures and yesterdays was particularly thought provoking. The insights on how we value life, and the contrast between the reality of politics and the course that logic would suggest were absolutely fascinating.

    The lecture provoked a couple of thoughts which I hope you don’t mind me sharing.

    I wonder if a useful distinction to use in the approach to safety in critical systems is the distinction between system and operator. You rightly say that, “in the old days” safety engineering was a matter of looking at dangerous states, and seeing what component failures could produce them. This is certainly true, but there was also the additional dimension of operator error. In a large system it is usually theoretically possible for an operator to take a number of actions which will, without any component failure, lead the system to a dangerous state. This can be dealt with by systems design (interlocks and state warnings etc), but is often also dealt with by process flows and secondary operator checks. Thus, there are two very distinct categories of failure. One caused by a failure of a component of the system, and one caused by a misuse of a correctly operating system.

    In software control systems this distinction has got a bit blurred, as the software has taken over some (if not all) of the functions of the operator. It seems to me that it might be useful to reinstate this distinction, as it may require different approaches.

    Software can fail because of what you might call component failures. A disc can crash, there can be a power failure, or a new set of circumstances can cause a branch of software to be traversed which contains a line of code that is simply incorrect. To some extent this can be dealt with by a classical safety approach. What is the likelihood of such a failure, and what do we need to do to ensure that such a failure does not lead to a dangerous state.

    However, there is also now the issue of “system acting as operator”. Here, we come to the much more nebulous categories such as:
    The systems design failing to accommodate a set of conditions that were not anticipated.
    The system drawing the wrong conclusion from an unusual set of inputs
    And, of course, the combinational effect of a human operator directing the system to do something without realising the consequences of their command.

    These categories are only going to get worse, (much worse) with AI and machine learning based systems. I would contend that it is simply not feasible to design or test against such errors. To quote “you don’t know what you don’t know” and thus testing against a condition that has not been foreseen is a logical impossibility. Testing against a system which, by its nature, is designed to generate new causal links is also logically ridiculous. It’s like saying you can test a baby to make sure that it will never do any harm in society as an adult.

    Rather, I would suggest that a better approach is to go back to outcomes. It is more useful, I think, to try and identify dangerous system states, and to guard against those without really caring how you got there. There are, usually, a more finite set of states that can do harm, and guarding against these states, or sequences of actions that might lead to these states is, I think, a more useful approach.

    Making a recognised distinction between these two failure modes would, perhaps, allow classical safety analysis to do the job it was designed to do, and also allow us to learn lessons from plant operation in the past where they have had to cope with the possibility of novice, hung over and malicious operators.

    Lastly, on a lighter note, I thought you might be amused by the approach that a colleague of mine suggested adopting to improve confidence in software testing. The release of software is always a scary time. All of one’s tests have passed but, as you pointed out last night, that is absolutely no guarantee that there are no bugs left. His solution to this problem was to get his architect to “seed” 100 bugs into the system. If, say, the testing found 98 of the seeded bugs, and also 1000 other bugs, he maintained that you could reasonably infer that there were about 20 unfound bugs in the main system. Not entirely comforting, but at least it gave a metric on the risk of deploying to live!

    Very best wishes,

    Nick James.

    1. Dear Mr James

      Thank you for these thoughts and for the encouragement about my lectures.

      The role of the operator or user is an interesting one. Recent academic work in dependability has moved towards regarding most systems as sociotechnical (which means that the operator or users are system components and their role and training becomes the responsibility of the system designers). There are useful papers at http://www.dirc.org.uk/research/DIRC-Results/index.html if you are motivated to follow this up.

      Considering the operator as outside the system makes it easier to ascribe system design faults to “operator error”. This is particularly prevalent in healthcare, where thousands of avoidable injuries to patients every year are really due to badly designed or implemented HCIs. (Look at Harold Thimbleby’s research papers for the evidence and proposed solutions – he’s also a past Gresham prof so you may find his Gresham lectures interesting – they are on line).

      The injection of bugs before debugging became popular in the 1970s with the name “bebugging”. Weinberg refers to it in The Psychology of Computer Programming – which is a brilliant book. IBM’s Harlan Mills used the same technique in Federal Systems Division in the 1970s. It fell into disuse because (as I recall) it became clear that real bugs and artificial bugs were of a different nature, both in terms of their detection by debugging techniques and in terms of their effect on system reliability.

      Regards

      Martyn

Leave a Reply