Automotive Reliability: The Bathtub Curve

21 Apr 2020 • 4 minute read

There are a lot of aspects of automotive reliability. The same goes for aerospace and medical, which have additional issues, too, so I'm going to focus on automotive. I'm also going to focus on semiconductor reliability, although obviously other components can fail.

FITS

Reliability is measured in FITS, which stands for failures in time. This is the number of failures expected in a billion hours of operations. Since nothing actually lasts a billion hours, it is a statistical measure.

The big challenge in automotive semiconductors is to build a reliable car out of unreliable components. The requirement for a whole car might be FITS<10, so less than ten failures in a billion hours of operation of the vehicle. This FIT budget is then parceled out so that at an individual semiconductor chip the requirement may be FIT<0.1. But a modern semiconductor process has a FIT measured in the hundreds without any additional mitigation.

In the past, automotive semiconductors addressed this largely by not building anything in a modern semiconductor process. They would use older processes with a decade of reliability data. Most "old school" automotive chips had a high analog component, and perhaps an 8-bit or 16-bit microcontroller. After all, the highest performance function was typically engine control, and most electronic control units (ECUs) were adjusting the seats or the radio volume.

That has all changed with ADAS and the gradual move towards autonomous vehicles. These require multiple cameras, radars, lidar, high-performance networks, and high-performance vision and AI processing. It is simply not possible to build functionality like that on a ten-year-old process. However, the need for reliability has not changed. There is a sense in which this is a clash of cultures. The engineers who know a lot about automotive reliability know little about how to build a state-of-the-art SoC in a leading-edge process like 16nm or 7nm. The people who do typically grew up in markets like mobile, so they know how to build the SoC but know little about reliability. Nobody dies if your phone crashes, and nobody expects to keep their phone for 15 or 20 years.

Wearout

One aspect of making cars last 15 or 20 years is that transistors wear out. You might not have known that—I managed to be in the semiconductor industry for decades without discovering this. The effect is more extreme in FinFET processes due to more self-heating. Above 28nm (where processes were planar) this wasn't really a problem. This wear-out is normally called "aging", although it is less a function of the years passing than it is of how much the transistors are used. The gradual move towards autonomy has an upside and a downside for this. Autonomous vehicles are expected to have a much higher duty cycle since they won't sit in a parking lot or a garage most of the time. On the other hand, they won't last for 20 years.

In the digital and software realms there are lots of things we can do to improve reliability and reduce the FITS. Triply redundant backup safety processors, error-correcting memory. I wrote about this aspect of automotive in a post Make Sure Your Car Doesn't Break Too Often...When It Does, Make Sure You Catch It. The title is a pretty good summary of the requirements.

But the dirty secret of automotive, at least today, is that 80-95% of automotive semiconductor failures are due to the analog portion.

The Bathtub Curve

Here's a bathtub curve. The reason for the name is pretty obvious. Time goes from left to right as usual, and the vertical access measures the failure rate. On the left in dark grey are failures due to infant mortality, early life fails. On the right in light gray are failures due to wear-out, or end of life.

Before the invention of vaccination and antibiotics, when many children died from "childhood diseases" like measles or polio, the human lifespan had a bathtub curve. When you see that, in 1700, life expectancy was 40, that is not because there were no old people. Lots of children never made it to adulthood—for example, 5% of babies died In their first week and 60% before the age of 16. But if you made it that far, your life expectancy, now that you were immune to many diseases, was over 60.

Transistors are like that, although the percentages are nowhere near as lethal. There are early life failures mostly due to some sort of manufacturing defect or what is known as a tester escape—the part was bad all the time but the test program didn't detect it. Then there is a long period when there is a very low failure rate, ideally zero. The graph is not really to scale since that mid-grey part in the middle is the 20-year lifetime of a vehicle. Then the transistors wear out. They typically don't completely fail, but gradually over time their threshold voltage shifts and eventually, you end up with something like an analog component that fails to detect a voltage correctly.

What Can We Do About It?

This post has got long, so I'll cover this topic tomorrow. But this diagram shows what we are aiming for. We want to extend the useful life in the middle as much as possible by eliminating parts with early life fails earlier, then a long uneventful twenty years or more in the middle, and postpone aging effects as long as possible.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

"+ res.PreviousPostTitle); // //NextPostUrl // //Previousposturl // } // }); }); if ( $('.blog-post.nextweb-blog-post .ifrmesrc').length ) { iframeattr = $('.blog-post.nextweb-blog-post .ifrmesrc'); markup = ''; $('.blog-post-content .ifrmesrc').html(markup); $('.blog-post.nextweb-blog-post .ifrmesrc').show(); } -->