CDNLive EMEA: Do You Know What a FIT Is?

12 May 2016 • 6 minute read

ADAS changes everything about automotive and requires a whole new range of skills in the automotive ecosystem that they have not had before: video and image processing, deep learning, high-bandwidth networking. One area that I hadn't really thought through was brought home to me by Adam Sherer's presentation on functional safety at CDNLive EMEA. Traditionally, automotive semiconductors have been designed in trailing-edge processes such as 130nm or 90nm, where there are already years of experience qualifying the silicon for the hostile environment of a vehicle with extended temperature ranges, stricter safety and reliability requirements, and so on.

At an earlier presentation during the conference, Matthias Wenzel of Renasas presented about their latest ECU and how ADAS is reorganizing how cars are architected. Infotainment and instrumentation are merging fast, but ADAS is still a separate ECU. ADAS is heavily influenced from Europe so Renesas, which of course is a Japanese company (the semiconductor businesses of Hitachi, Mitsubishi and NEC), has the business unit in Europe. He expects long term that people will want ADAS as a software-enabled feature, not a black box added to the car. In the meantime, the Renasas R-car H3 automotive computing platform is the first 16nm automotive-compliant SoC designed for performance and functional safety.

Adam reiterated the same message, that the performance requirements mean that leading-edge processes need to be used, but that functional safety requirements mean that the process cannot simply be used as if it was going into a cellphone. It is not the end of the world if we need to reboot our cellphone, but our automatic braking needs to be handled differently.

Functional safety is measured in FITS, which stands for "failures in time" and is the number of failures per billion hours of operation. The requirement for the whole car is perhaps FITS < 10, so less than ten failures of a vehicle in a billion hours of operation. This is then hierarchically partitioned out to the various subsystems, 1 FIT here, 1 FIT there. By the time you get down to an individual SoC, the requirement may be a FIT < 0.1. The challenge is that 28nm has a FIT of around 500 without any additional error handling, such as error correcting codes or redundancy. And 16nm is even worse.

There are two problems with leading-edge processes. There isn't the same experience of using the process for years and so having everything characterized. Car electronics have to last for 20 years, so aging is another important consideration. Another issue is that smaller transistors are more vulnerable to single-event effects (SEE) such as high-energy neutrons. These are inevitable since we live on a radioactive planet bombarded by cosmic rays.

The way to address this is to use failure mode effects analysis (FMEA) and failure mode effects and diagnostic analysis (FMEDA) to reduce the FIT by several orders of magnitude. This is a four-step process:

Continue FMEA and FMEDA analysis, creating smaller FITs for each function down to block level
Transform failure modes into safety goals through error detection
Run functional verification campaign for SoCs to validate that functionality goals have been achieved
Run safety verification with statistical fault sampling based on allowable fault rates to validate that safety goals have been achieved

A separate issue is that the chip development process uses a lot of tools (I believe Cadence supplies some of them!) and so it is also necessary to do tool confidence level analysis (TCL, nothing to do with the scripting language).

Adam is the product manager of a very old product called VeriFault. Before scan testing became standard, manufacturing test would use the functional simulation vectors, perhaps augmented with some additional vectors. The potential test vectors would be evaluated by running a fault simulation, seeing what percentage of faults would be "detected" in the sense that some output vector would fail to match. In this terminology, which operated at the gate level, a fault was modeled as a signal being stuck-at-0 (so 0 no matter what the cell driving it tried to do) or stuck-at-1. As scan test got more powerful, sales of VeriFault declined, partly because designs were too large for gate-level simulation and partly that scan test generally didn't require running a separate fault simulation. Then sales started to pick up.

The reason was that the same issue underlies making automotive chips more resilient. If a transistor fails, will we notice or will the system just misbehave in a random manner? I have seen estimates that half the "blue screen of death" crashes on old PCs might have been due to badly handled SEEs. Certainly Cisco came to the conclusion that many mysterious crashes of high-end routers were. But in a car, where a blue screen of death might involve an actual death, that is unacceptable. Errors either need to be corrected (such as by memory ECCs) or at least detected and have the SoC default to some safe behavior.

The same basic approach is used in the Incisive Functional Safety Simulator. Faults are chosen and then the simulation is run. Since typically the place where the fault first shows up is an internal pin inside the SoC, it is good to know two things. First, did the fault get out of the logic block and second, was it either corrected or detected (and signalled at one of the SoC pins)? Another tool called Incisive vManager is used to run the fault campaign, applying all the faults one after another, and accumulating and displaying the results.

The other aspect is TCL (tool confidence analysis). There are two approaches that are acceptable under ISO 26262 for doing this:

Certification
"Fit for purpose"

Just to be confusing, the "fit" in "fit for purpose" is just the regular English word and has nothing to do with FITs discussed earlier. Certification is a rigorous procedure that is done on named tool with named versions and a constrained flow. Any change (such as a bug fix) requires certification. So that approach is not going to work for something as dynamic as EDA tools for SoCs. Both approaches can achieve what is called TCL1 compliance with ISO 26262.

Fit for purpose, which is better for SoC design tools:

Documents safe use cases with user and vendor responsibilities
Is relatively simple when it comes to adding additional use cases
Is enabled by report from auditor
Is flexible across tool releases (typically a range of version numbers is OK)

One thing I learned at Virtutech, where we were involved in aerospace and automotive, was that tools are divided into two classes. Tools that don't affect the source code, such as verification, are in one group. The second group is tools that could potentially inject a problem into the chip, such as synthesis. When I was VP of engineering at Ambit, we split bugs into functional bugs (the netlist did something different from the RTL) and other bugs (the negative slack was huge, or the tool crashed). The above diagram shows the Cadence flow split up, with tools in black in the first category (they may fail to detect an error) versus the tools in grey (these might inject one).

Cadence is now proceeding to achieve compliance reports with TÜV SÜD with four flows, with completion expected by the end of Q3:

Digital D&V
Digital implementation
Analog/mixed signal
PCB

Functional safety is changing from what it was when automakers had the luxury of not having to go near leading-edge processes. Now, achieving a FIT < 0.1 from a process that inherently starts at 500-1000 requires a multi-pronged approach. It involves dealing with every potential problem, verifying that it is correctly handled, and even verifying the tools that did the verification (and the creation).

Next: Can You Trust Your IoT Supplier?

Previous: Cadence at DAC: Experts, Presentations, Lunch...And the Denali Party Instructions