Get email delivery of the Cadence blog featured here
Recently, I seem to have been running into people from Texas Instruments (TI) talking about various aspects of automotive reliability. I'm going to try and summarize three presentations that I attended in this post:
Both Frank's and Shane's papers won the best-paper award for their tracks at OIP (as voted on by the attendees). Also, Prashant's paper won the best paper award for the verification track at CDNLive India. So all three papers won their respective best paper awards, so congratulations to them (and their Cadence co-authors), and all the more reason to read the rest of this post.
Reliability is measured with FITs, which stands for Failure In Time, and is one failure per billion hours of operation. Since a billion hours of operation is over 100,000 years, no one device will operate that long. It is a statistical measure: if there are a million cars, then 1 FIT means one failure in the fleet per thousand hours of operation. If cars are used 10-15% of the time, then this translates into approximately one failure per year in a fleet of a million vehicles.
These three presentations address different aspects of the automotive reliability problem. Functional safety involves injecting potential faults into the chip, and seeing which ones are correctly handled. Aging is something that has come along relatively recently since transistors didn't really age until about 28nm, and the issue got more important with FinFETs, which are subject to a lot more self-heating due to the physical configuration. Electromigration is a phenomenon where the current literally moves the atoms of the interconnect metal in the direction of the electron flow. Unfortunately, there is a positive feedback effect, so that when a metal line gets thinner, the current density goes up, and so electromigration gets worse, and potentially metal will open causing a failure.
Oversimplifying, the chip can fail due to a transistor failing—the second presentation. Or the interconnect failing—the third presentation. Even in the face of these failures (and others such as software), automotive chips should address the problem if possible—the first presentation.
Functional safety (often abbreviated to FuSa) is the absence of unacceptable risk due to hazards caused by malfunctioning behavior of electrical and/or electronic systems. Since semiconductors are less reliable than the requirement for automotive reliability, it is not possible to simply design-in zero risk components. Instead, the philosophy has to be "detect and be safe".
There are two failure modes: permanent faults that remain until the vehicle is repaired (such as a metal line opening), and transient faults that occur and subsequently disappear (such as an alpha particle).
Analyzing the functional safety of an integrated circuit consists of considering all the things that might go wrong, and seeing how the system handles them (and redesigning if appropriate). This is known as FMEDA, which stands for Failure Mode Effect and Diagnostic Analysis. The approach is similar to the old way that we did IC test before scan-test became dominant. We would consider gate as being stuck-at-0 or stuck-at-1, and then see which of those failures was detected by the test vectors (this is an oversimplification). In this context, "detected" means that the output from the IC was different from the "known good simulation" where there were no faults present.
In the automotive area, faults are divided up into 4 categories, depending on whether they are dangerous and/or detected. Obviously dangerous and undetected is the situation we especially want to avoid (such as an undetected failure of the braking system). Detected can mean anything from turning on a light on the dashboard (showing, say, an airbag issue) to correcting a single bit memory error, to putting an autonomous vehicle into limp-home mode.
TI have been using an early version of Cadence's concurrent engine for this. There are some limitations since this is a pre-release version. For example, only permanent faults are considered, not transient. It works at the netlist level not the RTL level. There is no fault optimization, merging faults that are equivalent. Ideally, you would only do fault injection on primary faults, not all the equivalents.
Prashant wrapped up with:
Somehow I managed to work in the IC and EDA industries until a few years ago before I discovered that transistors "wear out". This is known as aging, although typically it is more a function of how much the transistors are used than of pure wall-clock time passing. For an introduction to the topic, see my post Aging and Self-Heating in FinFET Transistors.
Frank started out with some warnings. Today's semiconductor ecosystem is geared to consumer (your iPhone doesn't need to last 15 years or operate at 150°C) but automotive (for now I'm just going to say automotive, but you can include industrial, medical, aerospace and other environments requiring high reliability) needs a different mindset. Some failure modes can be accelerated, meaning it is possible to do silicon qual to get a number, but some do not. Some analog problems require 15 years to develop with no efficient way to accelerate the stress, meaning that they have to be done by accurate up-front analysis.
Today, aging coverage is 100% on the shoulders of the designer to identify the high-risk circuits and stress state, then develop stimulus to bring about worst-case aging, and verify that the circuit still functions at end-of-life (after all the worst-case aging has taken place).
One challenge is that worst-case performance conditions may not align to worst case aging conditions, and the most critical performance mode might not align to the most damaging aging mode. TI's rules for their aging validation flow are:
Frank had an example, a USB2 in 16FFC which he called "a challenge for aging reliability". The things they found were:
Wrapping up, Frank acknowledged that ensuring aging reliability is tedious and significantly increases design time...on the motivational side, field failures are really bad and expensive. The tools are continuing to mature and are quite robust—the challenge is coverage during design validation. But, as he said at the beginning, it is 100% on the designer's shoulders.
Reliability hasn't really been in the EDA industry's crosshairs until the leap in the importance of automotive over the last few years. That means that typically reliability flows have been home-grown. In the case of electromigration, that means rigorous per-wire/per-via computations for the EM FIT rate. However, without standard methodologies, the amount of characterization done on 3rd-party IP was limited.
Electromigration in a modern process (Cu interconnect) causes the copper atoms to move across grain boundaries and can cause migration and voids, both opens and (where the Cu atoms build up) shorts. It is very strongly correlated with temperature, which means it can be accelerated with high-temperature testing, but also means that it is a bigger issue with the extended automotive temperature range up to 125°C.
In a modern process, such as 16nm or 7nm, IR-drop effects can cause a 10% performance degradation, and EM effects can cause an order of magnitude increase in the FIT rate. Metal resistance is an increasingly hard problem to address.
TSMC has introduced a flow called Statistical Electromigration Budgeting (SEB) which works with Voltus and Voltus-Fi. This flow is transformative since without a statistical approach, low EM FIT typically has to redefine signoff limits to be much less than the raw foundry EM rules. But SoC reliability is very different from component reliability, so meeting a failure rate of (say) 0.1% does not imply that the SoC reliability will be no worse than 0.1%. The other aspect of this flow is that it is standardized, making it reasonable to expect it to be used by 3rd-party IP suppliers.
The above diagram shows the flow. Some issues Shane pointed out were:
Shane's conclusion was that the SEB flow is a major step forward in standardizing ad-hoc methods to estimate EM FIT rate for an SoC design. There is no broad support from the EDA industry. This makes it feasible to assess EM FIT contribution from 3rd-party IP, not just internally built IP.
Sign up for Sunday Brunch, the weekly Breakfast Bytes email