I'd like to continue my blog series discussing corner-case conditions of various kinds that I have encountered in my engineering career. So far they've all had happy endings. I discussed a software bug that was only in a prototype, not an actual product, so no real damage was done. I described a subtractor bug and a class of interface bugs in hardware, all of which were caught in verification prior to chip fabrication. I even told a Murphy's Law story that had as positive an outcome as possible. This post does not have a happy ending; it's the tale of the bug that got away and required a silicon re-spin to fix.
I'm going to be intentionally vague on the company, the product involved, and my exact role. But I will say that I was in charge of the project to make it clear that I am not shirking responsibility for the bug or the trouble it caused. Without going into details on the overall design, I'll focus in on the problem area: a FIFO crossing between two asynchronous clock domains. As is well known, asynchronous clocks present all sorts of tricky problems for the hardware designer. In preparing for this post, I found an excellent handout from a Stanford course outlining many of the challenges and solutions.
This FIFO was one of a pair that sent data back and forth between the two clock domains. It was a fairly standard design, much like the one in the Stanford handout except that it had free-running clocks on both interfaces with "read" and "write" signals to control FIFO operations.
The designer of this FIFO was very experienced. He knew that he had to synchronize signals crossing the asynchronous clock boundary to avoid metastability. He knew that he needed a single point of synchronization for the data bus. He knew that the read and write pointers had to be gray-coded so that only a single bit changed at a time. In fact, the FIFO he designed worked perfectly for any arbitrary relationship between the two clocks. In fact, this FIFO had been used successfully in several generations of chips. Since it worked so well, it had become essentially a library element to be reused again and again.
So where's the bug? One day the same designer was working on a new generation of the design that required a 64-bit data path to cross the asynchronous boundary. As a big believer in reuse, he constructed a 64-bit FIFO using the well-proven 32-bit FIFO as a building block.
I'm sure that some readers are already gasping in horror, but for the rest let me finish the story. The chip was fabricated and bring-up commenced in the lab. It was quickly clear that something was wrong. The chip booted up correctly and sent data back and forth between its various ports, but at some point, typically after a couple of hours, wrong data would begin transmitting. The debug engineer discerned that the erroneous data happened on only one half of the 64-bit bus and eventually that the two halves of the bus were getting out of phase. Suspecting a synchronization error on the asynchronous interface, he looked at the RTL and found the bug by inspection.
The FIFO designer, in his eagerness for reuse, had violated one of the cardinal rules for asynchronous interfaces. Note that the "Read" and "Write" inputs are split and fed into the two 32-bit FIFOs independently. Since this is an asynchronous design, and the delays on the signals and clocks are not identical, it is possible for a read or write signal to arrive at one FIFO before its clock pulse and at the other FIFO after its clock pulse. This results in half of the 64-bit bus being delayed one cycle behind the other half, thus passing corrupted data through the FIFO and beyond.
Of course, the FIFO designer should have realized this, and he all but pounded his head on the floor when he was informed about the bug. Lab bring-up was able to proceed by manipulating the two clocks so that they were not asynchronous, but complete validation of the design could not happen until the bug was fixed. The chip was re-fabricated to fix a few other problems as well, so while it's not literally true that this bug alone forced a re-spin it was certainly a contributing factor. As the engineering lead responsible for this project, I took a lot of heat.
Being a tool-oriented engineer (that's why I eventually ended up in EDA), I did some research on how we might have found and killed the FIFO bug. I learned that some design teams built into their simulation the ability to shift key signals forward or back one cycle to emulate the effect of missing a clock edge or triggering metastability. At the time, I do not recall finding any commercial EDA tools that could statically analyze clock crossings for bugs of this nature. Of course, today the clock-domain-crossing (CDC) checks in the Cadence Encounter Conformal Constraint Designer provide a push-button way to ensure correct design across asynchronous clock boundaries.
During my time as a hands-on hardware designer, I do not recall any silicon re-spins due to bugs that I personally introduced in a chip and missed during verification. I was not as lucky as an engineering manager; I know of two cases where designers on my team had bugs that got away and caused re-spins. Both experiences were painful but instructive, and at least I can say that we did not make the same mistakes again. If I can recall enough details of the other re-spin, I will do a blog post on that bug as well. I'm willing, even eager, to drag some skeletons out of my closet if they will help communicate my message to "verify, verify and verify some more."
The truth is out there...sometimes it's in a blog.
I just read your response, tomacadance, of February 22nd. That description gives a little more information. The bottom line is that there was a violation in this design of the #1 rule (or at least one of the top three rules) of synchonizing between clock domains, and that is, "Synchronize only at one point in the design." If what you describe about the write signal is true, then the write signal was synchronized at two points and the resultant output of the two synchronizers will occasionally mismatch.
I have to agree with the comments that what is typically synchronized across the clock domains is the gray-coded address and not the read and write signals which should be synchronous to their respective clock domains. If the two empty signals from both FIFOs were logically OR'ed and the two full signals from both FIFOs were logically OR'ed, then this might have worked, but I would have to give this some thought. The better solution, of course, is a design where only one set of flip flops is synchronizing the gray-coded address, i.e. a FIFO design with a parameterized data width.
Dave and Muzaffer, this bug was quite a few years ago and I retained no confidential notes from that employer, so I'm relying only on imperfect memory. The read signal is indeed synchronous with the FIFO, but my recollection is that the write signal went directly into a synchronizer in the control logic of the 32-bit FIFO. Thus the write signal split into two synchronizers in the 64-bit FIFO, with the problem I described. But even if the problem occurred deeper in the control logic the root issue is the same - at some point signals that are supposed to be identical cross the clock boundary and may no longer match. Eric is correct that metastability as well as routing difference could trigger the mismatch. Thanks to all for the comments!
Good post. On the issue, I agree w/John's comments. Parameterized designs are superior for library building blocks which would have avoided this issue by design.
One point of clarification: even if all signals are delivered w/identical timing, it still would have been broken functionally as no 2 flip-flops have identical metastability characteristics. Given there were 2 FIFOs so 2 sets of pointers, there were 2 bits changing for any given operation (write or read) for what design-wise was intended to be a grey-code (1 bit changing) for the entire datapath vs. 1 bit per half datapath. So the design was functionally flawed (the net pointer for the entire datapath was not a grey-code), independent of timing/silicon implementation characteristics.
I agree a formal verification tool should catch this issue easily.
Actually the description of the bug doesn't make too much sense to me. Read/write signals are synchronous to their respective side so they should be checked for timing and they're not propagated to the other side so they couldn't have been involved in the bug.
Most likely the problem was that the two gray-coded buffer pointers didn't move to the other side in lock step which is perfectly possible and expected as the clock incrementing them and synchronizing them on the other side are async to each other.
This bug doesn't make sense to me as described. The write or read operation on a given interface will occur on the clock edge for that interface as the controls signals are synchronous to the clock on that interface. I believe the hazard would lie in the fact that there are two sets of control logic, one for each FIFO, which maintain their own sets of pointers. This could create race conditions where the full/empty indication being used from one FIFO does not accurately represent the state of the second FIFO.
Twenty years ago I saw the exact same design flaw in a board design using two 4-bit FIFOs to move a byte of data around. The engineer used the flags from one FIFO and ignored them from the other. The product checked out in the lab, but once in production we started noticing problems crop up on some (but not all) cards. Since the engineer didn't add any pads for spare ICs (his design was 'perfect' and therefore should never require a wiring mod), we had to re-spin the card as dead bugs had to be added to logically AND the flags of the two FIFOs together.
Thanks for two good points. I believe that the gray-to-binary and binary-to-gray conversions can be parameterized as long as you use an algorithmic approach rather than a look-up table. And you are absolutely correct about the game of chicken. My team was not responsible for the entire chip and so I did not know for sure whether any of the other bugs would have also required a re-spin. But our bug could not be fixed without a re-spin and so we got the blame. And we deserved it.
Lesson Learned: Do not hard code magic numbers into your design. If the original design had made WIDTH a parameter with a default of 32 then this never would have happened. The reuser would have simply bumped it up to 64 and everything would have worked fine.
Remember it is all about winning the game of "respin chicken". If you have an issue but can reconfigure the chip as a work around then you are not blamed for the respin. But when you have one must have feature bad and no work around then they get the blame and everyone else gets to piggyback their fixes onto that respin.
Thanks; I'm glad that you enjoyed it. I'm trying to remember enough about the other bug to blog about that too.
Thanks for sharing this painful experience. Good clear description.