"The Safest Train Is One that Never Leaves the Station"

25 Apr 2017 • 7 minute read

The Sunday and Monday at IRPS are tutorial days, with multiple tracks. On Monday I attended the automotive track. IRPS has historically been very focused on device reliability, things like device aging and single-event-effects. However, automotive is pushing reliability up to the system and software levels. Semiconductor companies are salivating since their biggest market, mobile, is flattening.

Andreas Aal of Volkswagen

Andreas Aal of Volkswagen pointed out that VW shipped over 36B automotive semiconductors last year. As part of the overall semiconductor business, automotive is less than 10%, but it is the fastest growing segment. By 2020, 15% of all connected devices will be in cars. He thinks (and he works for VW, so probably not an objective opinion!) that the key driver for semiconductors was the PC, then the smartphone, and next will be the smart car.

But the numbers back him up. ADAS is growing at 23% and expected to reach $60B by 2020. Automotive connectivity is growing even faster, at 33% and should reach $141B by 2020.

He covered a huge amount of material, more than I can possibly hope to summarize in a single post. He had 110 slides, some including integral signs and logarithms. But I'll point out a few somewhat random key points:

Quality control becomes reliability control once stress is applied
30-50% of electronic system failures in cars are related to a change (the initial electronics worked). Change management is thus a key issue. Seemingly trivial changes such as a leadframe or mould compound may seem benign but can have long-term reliability effects
Below 45nm, even when AEC qualified, semiconductor reliability is not assured
There is a lot of specific testing required beyond the silicon, especially packaging under mechanical and thermal stress, along with analysis of the degradation of the package over the lifetime of the vehicle

Karl Greb of NVIDIA

Karl gave an introduction to ISO 26262. He emphasized that he had 90 minutes and an introduction to ISO 26262 from a good training company is 20 to 40 hours of classwork. So we were not going to become experts (and you are not going to become expert just from reading this post). Also, given that this was at IRPS, he focused on hardware development, not software.

He opened with a quote from the 19th century, where the attribution is unknown: the safest train is one that never leaves the station. The point being that risks can never be reduced to zero. There are two ways to reduce risk: reliability engineering and functional safety engineering. To show the difference, he discussed a motor winding that may overheat and cause a hazard. Reliability engineering would be designing the winding to be more resilient to over-temperature conditions. Functional safety engineering might add a temperature sensor to detect over-temperature and switch off the motor.

In more detail:

Risk can never be reduced to zero
For each application, there is a non-zero acceptable risk level
Avoid failures that can be avoided
Detect failures that cannot be avoided, and transition to a safe state to avoid harm

There are two broad classes of faults, systematic and random. A systematic fault is deterministic and always occurs under the same conditions. The focus is on fault avoidance. A random fault is non-deterministic and may be described with a probability distribution. The focus is on fault detection.

A History of Functional Safety

Going back far enough, in an agrarian society there was poor record keeping, and low population density. There were also high accident rates with poor identification of reasons. There were no systematic attempts to improve safety.

Then came the industrial revolution. Population densities increased as people moved to the cities. Record keeping improved. There was a realization of extreme injury rates to workers. Unions were one of the key forces improving worker safety.

Early aircraft had catastrophic failure rates, often attributed to pilot error. Early rocketry suffered similar or worse failure rates, but since they were unmanned, that took humans out of the equation. Attention started to shift from people to equipment.

In late 1950s and early 1960s, the space race began. "Failure is not an option." This was the birth of system safety and some of the key analysis techniques were developed: failure modes effects analysis (FMEA) and fault-tree analysis (FTA).

Two big industrial accidents in the 1970s, Flixborough in UK and Seveso in Italy, led to the UK COMAH Rules and the EU Seveso Directives (control of major accidents and hazards). In 1998, this led to IEC 61508 Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems. This was a general standard, not specific to any particular application, but derived from industrial approaches. It covered software and also introduced probabilistic evaluation of safety. It became the mother standard for the later standards (including ISO 26262, but we are getting ahead of the story).

The first derivative standards were for industrial process (oil and gas), industrial machinery, nuclear, and rail, introduced between 2001 and 2005. There were problems taking this approach and generalizing it to cars:

Hazard analysis and risk assessment (HARA) does not include the notion of controllability by the driver
Safety functions are not easily distinguished from the primary control system
Lack of normative approach to setting the level of required risk reduction
No support for the tiered (OEM, tier-1, tier-2...) supply chain used in automotive (see diagram below)

ISO 26262

The first edition of ISO 26262 was released in 2011, based on development that started in 2006. It completely replaces IEC 61508 and is not a child standard like the early ones for industrial and rail. It separately addresses vehicle (which it calls an "item"), system, hardware, and software. It covered just the electrical and electronic systems of production cars under 3500kg. It did not cover hydraulic and mechanical systems, specialist vehicles like Formula 1 race cars, trucks, buses, motorcycles, or off-road vehicles.

There are detailed rules as to how the vehicle (item) is broken down hierarchically and then different parts of the standard applied.

The standard defines Automotive Safety Integrity Levels (ASIL) based on severity of the hazard, exposure of the hazard (the probability of occurrence), and controllability by the driver to mitigate the hazard. Safety goals are set to achieve the necessary risk reduction. The ASIL levels are shown in the table below:

For hardware, there is a lot that defines how the requirements are pushed down the hierarchy. The standard does not assume that there is no history and existing parts, especially "simple" components, can be used. There are four compliance options for hardware:

In-context development, the default option, where the hardware is developed along with the functional safety approach. Most true hardware has lower level components, so this typically applies to SoCs/ASICs
Hardware qualification, integrating "simple" hardware into ISO 26262-compliant products (this was intended to cover basic components like resistors but apparently "simple" is not defined rigorously)
SEOOC (or SEooC) which is Safety Element out of Context, safely reusing system elements in the whole vehicle context. Semiconductor IP seems to fit in here, where the default assumption has to be that it might be used in a safety-critical context
Proven in use...what it says. Justifies use based on operational history

ISO 26262 2nd Edition and Beyond

One change is that the first edition uses the term "hardware qualification" but it means something different from when we talk about qualification or qual in the semiconductor world. It will become "hardware evaluation."

ISO standards have a three-year cooling off period for the new standard to be used and issues to emerge. Work on the second edition of ISO 26262 started in early 2015. In the meantime, there have been four important intermediate publications:

SAE J2980 Considerations for ISO 26262 ASIL Hazard Classification (2015)
ISO/PAS 19695 Motorcycles—Functional Safety (2015)
ISO/PAS 19451 Application of ISO 26262:2011-2012 to Semiconductors (2016)
SAE J3061 Cybersecurity Guidebook for Cyber-Physical Vehicle Systems (2016)

By the way, the reason motorcycles have their own standard is that it only makes sense to put minimal effort into improving motorcycle electronics when users already accept such an elevated risk. The easiest way to improve motorcycle safety is to use a car instead (using a motorcycle to go six miles is a risk of one micromort, you can go 230 miles by car, or 10,000 miles by commercial airliner).

The draft standard of the second edition is planned to be available in draft form in 3Q 2017 and published in 1Q 2018. Some of the changes are to incorporate the above publications (the new standard will not just apply to cars, but also trucks, buses, and motorcycles). Although it will partially cover cybersecurity, that is a topic of its own and work is ongoing to take the last of the above publications and build it into its own ISO standard.

A new concept is SOTIF, safety of intended functionality. This addresses safety when there isn't a failure. There is inherent error present in image sensors such as cameras and lidar, and their associated neural-network-based control systems. Work is started on this in 2016 on a two-year schedule under ISO/PAS 21448.