The Tale of the Silicon Re-Spin and the Bug That Got Away

17 Feb 2011 • 4 minute read

I'd like to continue my blog series discussing corner-case conditions of various kinds that I have encountered in my engineering career. So far they've all had happy endings. I discussed a software bug that was only in a prototype, not an actual product, so no real damage was done. I described a subtractor bug and a class of interface bugs in hardware, all of which were caught in verification prior to chip fabrication. I even told a Murphy's Law story that had as positive an outcome as possible. This post does not have a happy ending; it's the tale of the bug that got away and required a silicon re-spin to fix.

I'm going to be intentionally vague on the company, the product involved, and my exact role. But I will say that I was in charge of the project to make it clear that I am not shirking responsibility for the bug or the trouble it caused. Without going into details on the overall design, I'll focus in on the problem area: a FIFO crossing between two asynchronous clock domains. As is well known, asynchronous clocks present all sorts of tricky problems for the hardware designer. In preparing for this post, I found an excellent handout from a Stanford course outlining many of the challenges and solutions.

This FIFO was one of a pair that sent data back and forth between the two clock domains. It was a fairly standard design, much like the one in the Stanford handout except that it had free-running clocks on both interfaces with "read" and "write" signals to control FIFO operations.

The designer of this FIFO was very experienced. He knew that he had to synchronize signals crossing the asynchronous clock boundary to avoid metastability. He knew that he needed a single point of synchronization for the data bus. He knew that the read and write pointers had to be gray-coded so that only a single bit changed at a time. In fact, the FIFO he designed worked perfectly for any arbitrary relationship between the two clocks. In fact, this FIFO had been used successfully in several generations of chips. Since it worked so well, it had become essentially a library element to be reused again and again.

So where's the bug? One day the same designer was working on a new generation of the design that required a 64-bit data path to cross the asynchronous boundary. As a big believer in reuse, he constructed a 64-bit FIFO using the well-proven 32-bit FIFO as a building block.

I'm sure that some readers are already gasping in horror, but for the rest let me finish the story. The chip was fabricated and bring-up commenced in the lab. It was quickly clear that something was wrong. The chip booted up correctly and sent data back and forth between its various ports, but at some point, typically after a couple of hours, wrong data would begin transmitting. The debug engineer discerned that the erroneous data happened on only one half of the 64-bit bus and eventually that the two halves of the bus were getting out of phase. Suspecting a synchronization error on the asynchronous interface, he looked at the RTL and found the bug by inspection.

The FIFO designer, in his eagerness for reuse, had violated one of the cardinal rules for asynchronous interfaces. Note that the "Read" and "Write" inputs are split and fed into the two 32-bit FIFOs independently. Since this is an asynchronous design, and the delays on the signals and clocks are not identical, it is possible for a read or write signal to arrive at one FIFO before its clock pulse and at the other FIFO after its clock pulse. This results in half of the 64-bit bus being delayed one cycle behind the other half, thus passing corrupted data through the FIFO and beyond.

Of course, the FIFO designer should have realized this, and he all but pounded his head on the floor when he was informed about the bug. Lab bring-up was able to proceed by manipulating the two clocks so that they were not asynchronous, but complete validation of the design could not happen until the bug was fixed. The chip was re-fabricated to fix a few other problems as well, so while it's not literally true that this bug alone forced a re-spin it was certainly a contributing factor. As the engineering lead responsible for this project, I took a lot of heat.

Being a tool-oriented engineer (that's why I eventually ended up in EDA), I did some research on how we might have found and killed the FIFO bug. I learned that some design teams built into their simulation the ability to shift key signals forward or back one cycle to emulate the effect of missing a clock edge or triggering metastability. At the time, I do not recall finding any commercial EDA tools that could statically analyze clock crossings for bugs of this nature. Of course, today the clock-domain-crossing (CDC) checks in the Cadence Encounter Conformal Constraint Designer provide a push-button way to ensure correct design across asynchronous clock boundaries.

During my time as a hands-on hardware designer, I do not recall any silicon re-spins due to bugs that I personally introduced in a chip and missed during verification. I was not as lucky as an engineering manager; I know of two cases where designers on my team had bugs that got away and caused re-spins. Both experiences were painful but instructive, and at least I can say that we did not make the same mistakes again. If I can recall enough details of the other re-spin, I will do a blog post on that bug as well. I'm willing, even eager, to drag some skeletons out of my closet if they will help communicate my message to "verify, verify and verify some more."

Tom A.

The truth is out there...sometimes it's in a blog.