Texas Instruments on Automotive Reliability

30 Oct 2018 • 8 minute read

Recently, I seem to have been running into people from Texas Instruments (TI) talking about various aspects of automotive reliability. I'm going to try and summarize three presentations that I attended in this post:

Functional Safety Fault-Injection Journey with Cadence Safety Solution, by Prashantkumar Sonavane, presented at CDNLive India.
Automotive IP—Design Methods for Robust Aging Assesment and Validation by Frank Cano presented at TSMC's OIP Symposium.
Estimation of Electromigration Reliability by Shane Stelmach, also from OIP.

Both Frank's and Shane's papers won the best-paper award for their tracks at OIP (as voted on by the attendees). Also, Prashant's paper won the best paper award for the verification track at CDNLive India. So all three papers won their respective best paper awards, so congratulations to them (and their Cadence co-authors), and all the more reason to read the rest of this post.

Reliability is measured with FITs, which stands for Failure In Time, and is one failure per billion hours of operation. Since a billion hours of operation is over 100,000 years, no one device will operate that long. It is a statistical measure: if there are a million cars, then 1 FIT means one failure in the fleet per thousand hours of operation. If cars are used 10-15% of the time, then this translates into approximately one failure per year in a fleet of a million vehicles.

These three presentations address different aspects of the automotive reliability problem. Functional safety involves injecting potential faults into the chip, and seeing which ones are correctly handled. Aging is something that has come along relatively recently since transistors didn't really age until about 28nm, and the issue got more important with FinFETs, which are subject to a lot more self-heating due to the physical configuration. Electromigration is a phenomenon where the current literally moves the atoms of the interconnect metal in the direction of the electron flow. Unfortunately, there is a positive feedback effect, so that when a metal line gets thinner, the current density goes up, and so electromigration gets worse, and potentially metal will open causing a failure.

Oversimplifying, the chip can fail due to a transistor failing—the second presentation. Or the interconnect failing—the third presentation. Even in the face of these failures (and others such as software), automotive chips should address the problem if possible—the first presentation.

Functional Safety

Functional safety (often abbreviated to FuSa) is the absence of unacceptable risk due to hazards caused by malfunctioning behavior of electrical and/or electronic systems. Since semiconductors are less reliable than the requirement for automotive reliability, it is not possible to simply design-in zero risk components. Instead, the philosophy has to be "detect and be safe".

There are two failure modes: permanent faults that remain until the vehicle is repaired (such as a metal line opening), and transient faults that occur and subsequently disappear (such as an alpha particle).

Analyzing the functional safety of an integrated circuit consists of considering all the things that might go wrong, and seeing how the system handles them (and redesigning if appropriate). This is known as FMEDA, which stands for Failure Mode Effect and Diagnostic Analysis. The approach is similar to the old way that we did IC test before scan-test became dominant. We would consider gate as being stuck-at-0 or stuck-at-1, and then see which of those failures was detected by the test vectors (this is an oversimplification). In this context, "detected" means that the output from the IC was different from the "known good simulation" where there were no faults present.

In the automotive area, faults are divided up into 4 categories, depending on whether they are dangerous and/or detected. Obviously dangerous and undetected is the situation we especially want to avoid (such as an undetected failure of the braking system). Detected can mean anything from turning on a light on the dashboard (showing, say, an airbag issue) to correcting a single bit memory error, to putting an autonomous vehicle into limp-home mode.

TI have been using an early version of Cadence's concurrent engine for this. There are some limitations since this is a pre-release version. For example, only permanent faults are considered, not transient. It works at the netlist level not the RTL level. There is no fault optimization, merging faults that are equivalent. Ideally, you would only do fault injection on primary faults, not all the equivalents.

Prashant wrapped up with:

Drastic improvement in fault simulation time using concurrent engine.
RTL/Memory support in concurrent engine is a (future) must.
Tool has improved a lot since its inception.
Compute resource utilization and fault-injection management is much improved with vManager, but still scope for improvement.
Looking forward to vManager support for end-to-end safety solution.

Aging

Somehow I managed to work in the IC and EDA industries until a few years ago before I discovered that transistors "wear out". This is known as aging, although typically it is more a function of how much the transistors are used than of pure wall-clock time passing. For an introduction to the topic, see my post Aging and Self-Heating in FinFET Transistors.

aging requirements

Frank started out with some warnings. Today's semiconductor ecosystem is geared to consumer (your iPhone doesn't need to last 15 years or operate at 150°C) but automotive (for now I'm just going to say automotive, but you can include industrial, medical, aerospace and other environments requiring high reliability) needs a different mindset. Some failure modes can be accelerated, meaning it is possible to do silicon qual to get a number, but some do not. Some analog problems require 15 years to develop with no efficient way to accelerate the stress, meaning that they have to be done by accurate up-front analysis.

Today, aging coverage is 100% on the shoulders of the designer to identify the high-risk circuits and stress state, then develop stimulus to bring about worst-case aging, and verify that the circuit still functions at end-of-life (after all the worst-case aging has taken place).

One challenge is that worst-case performance conditions may not align to worst case aging conditions, and the most critical performance mode might not align to the most damaging aging mode. TI's rules for their aging validation flow are:

All operational modes (active and inactive) are simulated at 125°C and -40°C before and after aging.
"Fresh" and "Aged" simulations must show that the design still meets specs at end of life.
No single device can have a change in Idsat from BTI of more than 25%.
No single device can have a change in Idsat from HCI and NCS of more than 10%.
Even in power-down mode, some transistors may have finite Vds and nA of current and so can age.

Frank had an example, a USB2 in 16FFC which he called "a challenge for aging reliability". The things they found were:

HCI (hot carrier injection) and NCS (non-conductive stress) were the most damaging, but BTI (bias temperature instability) was the most prevalent.
Power down was the worst case for some circuits.
Unexpected worst case aging states.
Long durations at -40°C challenging.
More details in the table below: the driver threats were difficult but expected, but the NCS risks were a surprise.

Wrapping up, Frank acknowledged that ensuring aging reliability is tedious and significantly increases design time...on the motivational side, field failures are really bad and expensive. The tools are continuing to mature and are quite robust—the challenge is coverage during design validation. But, as he said at the beginning, it is 100% on the designer's shoulders.

Electromigration

Reliability hasn't really been in the EDA industry's crosshairs until the leap in the importance of automotive over the last few years. That means that typically reliability flows have been home-grown. In the case of electromigration, that means rigorous per-wire/per-via computations for the EM FIT rate. However, without standard methodologies, the amount of characterization done on 3rd-party IP was limited.

Electromigration in a modern process (Cu interconnect) causes the copper atoms to move across grain boundaries and can cause migration and voids, both opens and (where the Cu atoms build up) shorts. It is very strongly correlated with temperature, which means it can be accelerated with high-temperature testing, but also means that it is a bigger issue with the extended automotive temperature range up to 125°C.

In a modern process, such as 16nm or 7nm, IR-drop effects can cause a 10% performance degradation, and EM effects can cause an order of magnitude increase in the FIT rate. Metal resistance is an increasingly hard problem to address.

TSMC has introduced a flow called Statistical Electromigration Budgeting (SEB) which works with Voltus and Voltus-Fi. This flow is transformative since without a statistical approach, low EM FIT typically has to redefine signoff limits to be much less than the raw foundry EM rules. But SoC reliability is very different from component reliability, so meeting a failure rate of (say) 0.1% does not imply that the SoC reliability will be no worse than 0.1%. The other aspect of this flow is that it is standardized, making it reasonable to expect it to be used by 3rd-party IP suppliers.

The above diagram shows the flow. Some issues Shane pointed out were:

Limited access for EDA documentation.
Need to define a method for standard-cell IP vendors to model EM FIT dependent on use conditions.
Inaccessible statistical "knobs" such as varying temperature, voltage, and frequency vs lifetime.
EDA spec could be enhanced to calculate MTTF to eliminate inaccuracy due to simulated lifetime.

Shane's conclusion was that the SEB flow is a major step forward in standardizing ad-hoc methods to estimate EM FIT rate for an SoC design. There is no broad support from the EDA industry. This makes it feasible to assess EM FIT contribution from 3rd-party IP, not just internally built IP.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email