Statistical Power...or Why You Shouldn't Be Allowed to Turn Right on Red

6 May 2019 • 6 minute read

I wrote last Friday in my post TSMC: Zero Excursion, Zero Defect about the statistical processes that are essential in semiconductor manufacturing to get high yield, and to catch any issues that arise early enough to fix them. I suspect that high-volume semiconductor manufacturing uses some of the most advanced statistical techniques of any industry.

On the other hand, there seems to be a lot of bad statistics in academia. I wrote about some of that in my post Do Cellphones Cause Brain Cancer? and the problem of p-hacking. I think that this is actually a special case of Goodhart's Law:

When a measure becomes a target, it ceases to be a good measure.

Originally, p-values for statistical significance were a way for a scientist to test if they were fooling themselves. They could derive some conclusion from their data, say drinking coffee causes cancer, and then calculate the p-value, which is the chance that you'd get results like you did even if coffee has no effect at all. However, once p<0.05 became the gold standard for publication, it ceased to be a good measure because people would just go trolling through the data looking for effects that would get you your publication, but were simply spurious and would not replicate when someone else tried it.

In 2005, Stanford's John Ioannidis published Why Most Published Research Findings Are False and now there is generally held to be a replication crisis in social science in general. In fact, there are articles in mainstream publications like Atlantic Monthly's Psychology’s Replication Crisis Is Running Out of Excuses. Three of the problems causing this are:

p-hacking (that I talked about in my earlier post)
Publication bias, whereby negative results don't get published so it results look stronger if you just survey the published papers. It really does look like coffee causes cancer if you never publish the papers that don't find that result
The wonderfully-named HARKing, which is deciding what you are going to look for after you have collected your data (Hypothesis After Results Known)

The move to pre-registration of experiments is somewhat a defense to all of these: before doing an experiment, you must say what you are going to look for.

Statistical Power

But today I want to talk about another problem, experiments without enough statistical power. When I was taught statistics as an undergraduate, I never even came across the concept. If p-values allow you to check whether you might be fooling yourself, since your results were pretty likely even if your theory was false, statistical power looks at the opposite question:

What if your experiment was so badly designed that it would miss the effect even if it was there?

The typical example involves, as it so often does, tossing a coin. Let's say you suspect a coin is biased and comes up heads more often than it should. You decide your test will be to toss it four times. It comes up heads all four times. Can you conclude the coin is biased? No, since that will happen 1/16 of the time anyway. This test (four tosses) is so woefully underpowered that even getting the most extreme result possible is not enough to conclude that the coin is biased ("the results were not statistically significant", you might say).

Why Is This Important?

A lot of medical trials are designed to test something like whether a drug works. This is usually done through an RCT, a randomized controlled trial. The group of patients is split in two, half are given the drug, half are not, and then the differences are analyzed (this is an oversimplification). The basic test is rarely underpowered, since the drug company knows it needs enough patients to get a statistically valid result. But then you might see a statement that:

No statistically significant side-effects were reported.

That sounds good. But you need to look at the statistical power of that part of the experiment. Were there enough patients to have a chance of detecting statistically significant side-effects even if they were there? Otherwise, you can't make any conclusion about the lack of side effects, you were just never going to find them. Many drug trials are apparently like this, where short of a side-effect like killing half the patients, it was simply not possible to find a statistically significant side-effect. I have no idea if this is through incompetence, or that drug companies' incentives are to have enough patients to show the drug works, but not enough to require side-effects to be reported.

It's actually worse. Surveys of papers in biology and medicine regularly find that most tests are so badly designed that they have no chance of finding the effect they are looking for, even if it makes a ridiculously large 50% difference to the outcome.

In the semiconductor world, if you want to conclude whether making some change (such as a different temperature in the annealing ovens, or extending an equipment service interval) then you need to make sure you have enough "patients", in this case, that you run enough wafers. Otherwise, you miss seeing an effect even if it is there.

Right on Red

A famous example is the move in most of the US to allow cars to turn right on a red light, starting in the 1970s. Traffic engineers said RTOR (right-turn-on-red) would cause an increase in accidents and injuries. So tests were run, and a typical test would find that there was a slight increase in the number of injuries, but no statistically significant increase. Gradually states allowed drivers to turn right on red. But these studies didn't have enough statistical power: they were never going to find a statistically significant increase in their data, even if the effect was very large.

And it was.

In 1981, the Department of Transport commissioned a properly powered study The Effect of Right-Turn-On-Red on Pedestrian and Bicyclist Accidents. They compared accidents due to RTOR to other accidents that presumably would have happened without RTOR, such as running a red light, not yielding to a pedestrian on green, cyclists ignoring the law, and so on.

The conclusions of the report are sobering:

This study found that the frequency of pedestrians and bicyclists struck by a motor vehicle turning right at a signalized location increased significantly after the adoption of RTOR ... Over one half of the accidents in the post period involved a vehicle turning right on a red signal.

It can be concluded from this study that the adoption of RTOR resulted in an increase in both pedestrian/ and bicycle/motor vehicle accidents. The increase began as soon as the law became operative and probably persists as long as the law is in effect. Pedestrian and bicycle accidents involving a motor vehicle turning right on red constitute between 1% and 3% of a jurisdiction's total pedestrian and bicycle accidents.

In some cases, the accident rate doubled, but the earlier tests were so underpowered that even that large effect was not statistically significant in the data they gathered—it simply had no chance to reach that level of significance given that accidents are relatively rare. They just didn't gather enough data. If they had, then probably we'd still not be allowed to turn right on red, as is the case most countries outside the US. According to Wikipedia, it seems China also allows it, as does Saudi Arabia.

So if you are in semiconductor manufacturing, make sure you run enough wafers. You probably need a lot more than you think. And be careful about turning right on red when you are driving.

Statistics Done Wrong

If you are interested in this sort of stuff, and I realize for many people it's about as interesting as a root-canal, then I recommend the book Statistics Done Wrong by Alex Reinhart. Cover at the top of this post. It is also available in Deutsch, 한국어, Italiano, 中文 (简体and 繁體), or 日本語. Much of the book (all of it?) seems to be online for free.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.