Soft Error Rates in Satellites and Cars

9 May 2017 • 8 minute read

Space turns out to be an interesting area for semiconductors, especially looking at the soft error rate (SER), which are glitches, known as single event effects (SEE) or single event upsets (SEU), caused by atomic particles that can flip a bit in memory (including flops), cause latchup that can only be undone by powering down the chip, or cause catastrophic damage that permanently disables full or partial functionality.

There are two different aspects where space and commercial semiconductors interact:

Space is a more intense environment than the Earth's surface, so learning and techniques required for designing chips for space can be re-used for chips not destined to go into orbit but where safety requirements are high.
Price, schedule, and expertise pressures mean that every semiconductor for space cannot be designed specially and commercial products not characterized for harsh conditions need to be used. Often cost and schedule force the use of COTS (Commercial Of The Shelf) parts.

I attended several talks at IRPS that looked at different facets of this issue. Here are two.

Autonomous Driving—Soft Errors to Security Challenges

This was an invited talk by Martin Duncan of ST. With a name like Duncan you might guess he is Scottish, a fact that was confirmed the moment he opened his mouth. "I'm in deep trouble," he started. "I need help." I was glad that I'd lived many years in Scotland so I wasn't struggling with his Glaswegian accent.

The numbers for semiconductor reliability in a vehicle are brutal. The failure rate is 1 part per million per year. But there are 8,000 of them. Over 15 years that means that 12% of cars will fail—due to a semiconductor random issues (with software and mechanical issues on top of that). It is a meta version of the problem that one in a quadrillion sounds a tiny issue until you apply it to a chip with a billion transistors switching a billion times a second, and it reduces to once-per-second.

The next challenge is that autonomous driving is changing the type of semiconductors he is working with, too. "We are already designing at 7nm, it is a new world."

The third challenge is that reliability is based on a car being used for 90 minutes per day. But shared electrical vehicles, which everyone expects will become the norm as the next generation switches from caring about cars to just wanting to get from A to B, will rarely be off.

For a reliability engineer, this is being hit on all sides at once: 7nm designs, intense use, so many semiconductors in total that small failure rates are still too large.

Reliability has a familiar shape that everyone calls a bathtub curve (due to the shape). There are errors up front in the early life of a semiconductor, due to infant mortality. This can be accelerated at high temperature (burn-in) but at increased cost. Then, at the other end of life, there are aging effects which eventually shift transistor thresholds or cause other failures. In between is most of the life of the car, and during that phase the biggest issues are single-event-effects (SEE) caused by alpha contaminants in things like solder, atmospheric neutrons, X-ray IC inspections. With 26 cores, 200MB RAM on board, there are now so many gates and so much memory that SER is a major concern despite the fact that addressing it has improved.

A couple of other things make it worse. Lowering VDD, which we all want to do to reduce power, causes an exponential rise in SER. Soft errors increase with temperature and so SER is a challenge at the high end of the automotive temperature range. Also, at those temperatures, single event latchup becomes a challenge, where an SEE doesn't just flip a bit but locks up an area of the chip completely.

But Martin had some sobering statistics on why autonomous vehicles will be a huge improvement:

1.5M accidents on European roads each year
1/3 caused by leaving lane
1/10 cause by too close spacing
2/3 caused by inattention
1/2 of accidents the driver doesn't brake at all
95% have some human error contribution
75% have only human error to blame

One challenge that I've heard another autonomous vehicle expert complain about, is that the industry doesn't get to compare itself to those numbers. If all they had to do was have fewer accidents than humans it would be easy. But they are measured against perfection, zero accidents. You may have seen the recent crash of an autonomous Uber vehicle. I just found a picture of it, and the caption reads "A self-driven Volvo SUV owned and operated by Uber Technologies Inc. is flipped on its side after a collision in Tempe, Arizona." Sounds bad, you'd be insane to trust a self-driving car, right? Then read the second paragraph of the article, like most people don't, and you find "another car 'failed to yield' and struck it, according to police." So the fact that it was autonomous was irrelevant, the other vehicle hit it so hard it flipped. That's the type of merciless publicity that autonomous vehicles will need to overcome. A more accurate caption would have been something like "Distracted speeding driver destroys expensive autonomous driving development platform."

You've probably seen the autonomous driving levels from 1 to 4/5. Martin had a roadmap of when different types of sensors would be required, along with a nice characterization of the levels as:

Feet off (cruise control)
Hands off (lane following, intelligent cruise control, etc.)
Eyes off (take over driving if the vehicle requires it)
Mind off (full autonomous driving)

Cubesat: Real-Time Soft Error Measurements at Low Earth Orbits

Finally Cubesat. This was a paper by a long list of authors from Vanderbilt University and AMSAT. It is one of the projects that are to communication satellite what Raspberry Pi is to a datacenter. They are small spacecraft with a totally different mindset to the billion dollar flagship missions. They need to have a different tolerance for risk, up to and including mission loss. They are built entirely out of COTS parts, especially semiconductors. By definition, COTS devices are not radiation hard for space, so the best that can be done is to have an idea of how failures will happen and how to recover. There were 12 Cubesat launches in 2015. ISRO (India) launched 104 of them in a single flight. 88 are from Planet Labs who are building a constellation to image the whole earth every day. So a Cubesat is an extremely cheap satellite, typically built on a tight schedule, and with low survival requirements.

The threats to electronics in space are many and radiation is only one. There are three categories of particles:

Galactic cosmic rays originate beyond the solar system and consist of protons, heavier ions, and alpha particles with energies up to GeVs. Some of these get trapped in radiation belts and affect low-earth orbits (LEO) with energies in the 100s of MeV.
Trapped electrons with energies up to 7MeV are more important in the medium earth orbits (MEO) and also reach down to the poles (due to the earth's magentic field shape).
Particles accelerated by solar mass ejections, mostly protons up to 100s of MeV, arrive in bursts where the flux can increase by several orders of magnitude.

Shielding is effective against low-energy electron and proton fluxes but even a 100 MeV proton can penetrate 10cm of aluminum, so shielding on its own is clearly not enough (weight is important on a satellite, obviously).

The presentation covered the experience of putting up a satellite explicitly designed to measure the effects of radiations on microelectronics. The telemetry was crowd-sourced, buried in the subaudible range of voice transmissions. Hundreds of ham radio operators have recorded and submitted hundreds of thousands of telemetry packets. In fact this satellite probably has the largest ground network in the world.

On the satellite, they had an experiment aimed at identifying radiation-induced SEU. There are 8x4Mb SRAM to absorb the hits, and the satellite broadcasts all SEE, resets, power failures.

The results. After one and a half years, they have seen about 3000 upsets in the memories, about four per day. Some days have none, some up to 12. It is pretty close to a Poisson distribution so this is effectively random. They also looked at multiple bit upsets (which was handled by two upsets within five minutes). 76% are single bits, 17% are 2 bits, 3% are 3 bits, and 2% are 4 or more. It is possible that a few of these are multiple SEE classified as multiple since there is no way to discriminate between a high-energy particle that upsets 2 or more bits, and two particles that both upset one bit each in a short time window.

Looking at the map, you can see that most upsets occur over the South Atlantic anomaly where the proton belt dips down to lower altitude.

In the Q&A someone asked about coronal mass ejections. The satellite only showed one storm in that time, and it is hard to extrapolate to the chances of devastating events since we are in a solar minimum right now so the space environment is benign. When a coronal mass ejection does hit earth, the flux could go from four per day to hundreds or more, and there is more likelyhood of single event latchup.

One conclusion from the research is that using COTS in space requires better prediction of how robust the parts are and how they will behave. Otherwise it is hard to know how to use them safely. Which brings me back full circle to where I started, that studying the extreme conditions of space with a few satellites is a good way to get a handle on behavior in the much more benign conditions on the surface, but with millions of vehicles.