No Safety without (Cyber-)Security!

2020/11/07

It’s a common experience: I talk to people developing safety-critical embedded systems, be it cars or medical devices, and, while clearly serious about product safety, they show little interest in security. A great example was when, as a part of an delegation of Australian academicians, I visited an automotive testing facility in another country. They showed us their impressive driving range, which featured all imaginable road conditions, from unsealed tracks via city roads with pedestrians and public transport to 200 km/h freeways. They proudly talked about all the safety recalls they triggered, and how they were now focussing on autonomous driving. I asked how much of the testing was about cyber-security. The answer was “nothing”.

This is crazy, given the spade of car-hacking attacks over the past number of years, which are obviously a huge safety problem! But those safety testers are not alone – the attitude is common across the car industry, and not much different in the space of medical implants. Most of these systems have serious security vulnerabilities, which are typically a result of mushrooming functionality and hence complexity. Complexity is always the arch-enemy of security, as most security failures result from unforeseen interactions of features. At the same time, features drive sales, so the more of them the better.

This even applies to the kind of devices where we do not typically think of “features”, as in medical implants. Consider a heart pacemaker. Decades ago that was an (electronically) simple device, that had a sensor or two, a control loop of probably at most a few 100 source lines of code (SLOC) and an electrode (actuator). That was enough to keep a person alive as long as they behaved carefully, within a relatively constrained operating envelope. But there was a clear motivation for doing better, to expand the operating envelope (for instance, allowing the wearer to exercise). So the number of sensors increased, as did the complexity of the control logic. But the expanded envelope required tuning for patient specifics, meaning adjusting operating parameters. That requires an external interface to allow the physician to access performance data and update settings. This needs to happen without opening up the patient, so it needs a radio interface.

Before you know it, you have device drivers for radios, and complex protocol stacks for providing communication over a lossy wireless medium. We are now easily into 100s of kSLOC in software complexity. This communications software will have bugs that can be exploited by an attacker. And once the attacker is on the device, they can interfere with its operation and harm, possibility kill the patient (while leaving no traces).

But these devices have to undergo rigorous evaluations to be certified for safety, right?

Well, sort-of. Reality is that these safety certification regimes do not really consider systematic attacks (remember the automotive testing facility?) How is that possible, given the strong and sustained safety focus in areas where devices can maim or kill?

I believe that at the root of the problem are attitudes, ways of looking at the problem. Traditional safety engineering is about avoiding faults and minimising their impact, and is based on some core assumptions:

faults are random, and the risk they pose can be expressed as a probability;
faults are rare, i.e. the probability can be made small;
faults are independent, meaning that the probability of multiple faults is the product of the probabilities, and therefore very small.

These assumptions are appropriate for physical artefacts. The problem is that, when dealing with software, these assumptions are all wrong!

Firstly, software behaviour is never random, it’s highly deterministic. Which means if a certain sequence of events trigger a fault, the same sequence will trigger the fault again. If an attacker knows that sequence, they can trigger the fault at will.

Secondly, software faults are not rare at all. The rule of thumb is 1-5 faults per kSLOC, with (unusually) well engineered software it may get as low as 0.3/kSLOC. But with 100s of kSLOC in your system, you have dozens, more likely 100s of faults.

And software faults are definitely not independent. If you know how to trigger one fault, you can combine it with the next one you find. And that means you can daisy-chain exploits to get deeper and deeper into the system. This also applies to systems that, on first glance, seem to employ physical separation. For example, cars have multiple networks (traditional CAN busses as well as wireless networks) for different purposes, where the critical control functionality is on a separate network from infotainment and other less critical parts. However, there are devices that connect to multiple busses and therefore act as gateways. Once an attacker has intruded an externally-facing network, they can then attack the gateway and work their way to the place where they can take over the vehicle. These days, physical separation is basically an illusion, except in some military settings.

So, the bottom line is that with systems containing significant amounts of software (that’s about all of them nowadays), safety must be looked at as a security problem. This requires a radical change of developer mindset. Instead of thinking about the likelihood of a fault, one has to assume that there is a fault, unless there’s conclusive proof of the opposite. And when we are looking at a 100 kSLOC network stack, we know there is a fault (almost certainly many). And, instead of thinking that the fault has some random effect (such as overwriting a random memory location) we have to assume it has the worst possible effect — overwriting the most critical memory location with the most damaging value. Because that’s exactly what an attacker will do.

So the focus must be on preventing those faults from becoming fatal. In particular, we have to assume the enemy is on the device, and must keep them away from the really critical assets. In the pacemaker example, we have to assume that the attacker hijacks the network stack and can execute arbitrary code, and we must therefore prevent interference with the life-critical control loop.

Obviously, the fundamental requirement here is isolation, enabling security by architecture: Components that are not fully trustworthy (which means anything that is more than a few 100 SLOC or has not undergone a thorough security assurance process) must only be able to interact with critical (and presumably highly assured) components via well-defined and small interfaces. And the isolation must be provided by an OS that is as highly assured as the critical components: if you can compromise the OS, then the interfaces can be bypassed.

People who know me will not be surprised when I say that the seL4 microkernel is a great choice here: it is the world’s first OS kernel with a mathematical proof of implementation correctness (and security properties based on that). And it is by far the best-performing microkernel (verified or not).

In any case, you need an OS that you can trust to enforce isolation, and an architecture that uses the OS-enforced isolation to protect the critical parts of the system. In other words, a security architecture enforced by a secure OS. Without that, there will be no safety.

This post originally appeared on 2020-11-05 on the ACM SIGBED blog.

From → embedded systems, operating systems and virtualization, safety and security, seL4

One Comment

Michael von Tessin permalink

Fully agree!
I know first-hand that what you wrote also applies to the hearing instrument and cochlear implant industry.
In my current job, I am trying to solve exactly those problems for hearing device firmware. A lot of work ahead!

Reply

No Safety without (Cyber-)Security!

Leave a comment Cancel reply

Recent Posts

Categories

Archives

No Safety without (Cyber-)Security!

Share this:

Leave a comment Cancel reply

Recent Posts

Categories

Archives