Cadence® system design and verification solutions, integrated under our Verification Suite, provide the simulation, acceleration, emulation, and management capabilities.
Verification Suite Related Products A-Z
Cadence® digital design and signoff solutions provide a fast path to design closure and better predictability, helping you meet your power, performance, and area (PPA) targets.
Full-Flow Digital Solution Related Products A-Z
Cadence® custom, analog, and RF design solutions can help you save time by automating many routine tasks, from block-level and mixed-signal simulation to routing and library characterization.
Overview Related Products A-Z
Driving efficiency and accuracy in advanced packaging, system planning, and multi-fabric interoperability, Cadence® package implementation products deliver the automation and accuracy.
Cadence® PCB design solutions enable shorter, more predictable design cycles with greater integration of component design and system-level simulation for a constraint-driven flow.
An open IP platform for you to customize your app-driven SoC design.
Comprehensive solutions and methodologies.
Helping you meet your broader business goals.
A global customer support infrastructure with around-the-clock help.
More Support Log In
24/7 Support - Cadence Online Support
Locate the latest software updates, service request, technical documentation, solutions and more in your personalized environment.
Cadence offers various software services for download. This page describes our offerings, including the Allegro FREE Physical Viewer.
The Cadence Academic Network helps build strong relationships between academia and industry, and promotes the proliferation of leading-edge technologies and methodologies at universities renowned for their engineering and design excellence.
Participate in CDNLive
A huge knowledge exchange platform for academia to network with industry. We are looking for academic speakers to talk about their research to the industry attendees at the Academic Track at CDNLive EMEA and Silicon Valley.
Come & Meet Us @ Events
A huge knowledge exchange platform for academia. We are looking for academic speakers to talk about their research to industry attendees.
Americas University Software Program
Join the 250+ qualified Americas member universities who have already incorporated Cadence EDA software into their classrooms and academic research projects.
EMEA University Software Program
In EMEA, Cadence works with EUROPRACTICE to ensure cost-effective availability of our extensive electronic design automation (EDA) tools for non-commercial activities.
Apply Now For Jobs
If you are a recent college graduate or a student looking for internship. Visit our exclusive job search page for interns and recent college graduate jobs.
Cadence is a Great Place to do great work
Learn more about our internship program and visit our careers page to do meaningful work and make a great impact.
Get the most out of your investment in Cadence technologies through a wide range of training offerings.
Overview All Courses Asia Pacific EMEANorth America
Instructor-led training [ILT] are live classes that are offered in our state-of-the-art classrooms at our worldwide training centers, at your site, or as a Virtual classroom.
Online Training is delivered over the web to let you proceed at your own pace, anytime and anywhere.
Exchange ideas, news, technical information, and best practices.
The community is open to everyone, and to provide the most value, we require participants to follow our Community Guidelines that facilitate a quality exchange of ideas and information.
It's not all about the technology. Here we exchange ideas on the Cadence Academic Network and other subjects of general interest.
Cadence is a leading provider of system design tools, software, IP, and services.
Get email delivery of the Cadence blog featured here
I'd like to continue my blog series discussing corner-case conditions of various kinds that I have encountered in my engineering career. So far they've all had happy endings. I discussed a software bug that was only in a prototype, not an actual product, so no real damage was done. I described a subtractor bug and a class of interface bugs in hardware, all of which were caught in verification prior to chip fabrication. I even told a Murphy's Law story that had as positive an outcome as possible. This post does not have a happy ending; it's the tale of the bug that got away and required a silicon re-spin to fix.
I'm going to be intentionally vague on the company, the product involved, and my exact role. But I will say that I was in charge of the project to make it clear that I am not shirking responsibility for the bug or the trouble it caused. Without going into details on the overall design, I'll focus in on the problem area: a FIFO crossing between two asynchronous clock domains. As is well known, asynchronous clocks present all sorts of tricky problems for the hardware designer. In preparing for this post, I found an excellent handout from a Stanford course outlining many of the challenges and solutions.
This FIFO was one of a pair that sent data back and forth between the two clock domains. It was a fairly standard design, much like the one in the Stanford handout except that it had free-running clocks on both interfaces with "read" and "write" signals to control FIFO operations.
The designer of this FIFO was very experienced. He knew that he had to synchronize signals crossing the asynchronous clock boundary to avoid metastability. He knew that he needed a single point of synchronization for the data bus. He knew that the read and write pointers had to be gray-coded so that only a single bit changed at a time. In fact, the FIFO he designed worked perfectly for any arbitrary relationship between the two clocks. In fact, this FIFO had been used successfully in several generations of chips. Since it worked so well, it had become essentially a library element to be reused again and again.
So where's the bug? One day the same designer was working on a new generation of the design that required a 64-bit data path to cross the asynchronous boundary. As a big believer in reuse, he constructed a 64-bit FIFO using the well-proven 32-bit FIFO as a building block.
I'm sure that some readers are already gasping in horror, but for the rest let me finish the story. The chip was fabricated and bring-up commenced in the lab. It was quickly clear that something was wrong. The chip booted up correctly and sent data back and forth between its various ports, but at some point, typically after a couple of hours, wrong data would begin transmitting. The debug engineer discerned that the erroneous data happened on only one half of the 64-bit bus and eventually that the two halves of the bus were getting out of phase. Suspecting a synchronization error on the asynchronous interface, he looked at the RTL and found the bug by inspection.
The FIFO designer, in his eagerness for reuse, had violated one of the cardinal rules for asynchronous interfaces. Note that the "Read" and "Write" inputs are split and fed into the two 32-bit FIFOs independently. Since this is an asynchronous design, and the delays on the signals and clocks are not identical, it is possible for a read or write signal to arrive at one FIFO before its clock pulse and at the other FIFO after its clock pulse. This results in half of the 64-bit bus being delayed one cycle behind the other half, thus passing corrupted data through the FIFO and beyond.
Of course, the FIFO designer should have realized this, and he all but pounded his head on the floor when he was informed about the bug. Lab bring-up was able to proceed by manipulating the two clocks so that they were not asynchronous, but complete validation of the design could not happen until the bug was fixed. The chip was re-fabricated to fix a few other problems as well, so while it's not literally true that this bug alone forced a re-spin it was certainly a contributing factor. As the engineering lead responsible for this project, I took a lot of heat.
Being a tool-oriented engineer (that's why I eventually ended up in EDA), I did some research on how we might have found and killed the FIFO bug. I learned that some design teams built into their simulation the ability to shift key signals forward or back one cycle to emulate the effect of missing a clock edge or triggering metastability. At the time, I do not recall finding any commercial EDA tools that could statically analyze clock crossings for bugs of this nature. Of course, today the clock-domain-crossing (CDC) checks in the Cadence Encounter Conformal Constraint Designer provide a push-button way to ensure correct design across asynchronous clock boundaries.
During my time as a hands-on hardware designer, I do not recall any silicon re-spins due to bugs that I personally introduced in a chip and missed during verification. I was not as lucky as an engineering manager; I know of two cases where designers on my team had bugs that got away and caused re-spins. Both experiences were painful but instructive, and at least I can say that we did not make the same mistakes again. If I can recall enough details of the other re-spin, I will do a blog post on that bug as well. I'm willing, even eager, to drag some skeletons out of my closet if they will help communicate my message to "verify, verify and verify some more."
The truth is out there...sometimes it's in a blog.
I just read your response, tomacadance, of February 22nd. That description gives a little more information. The bottom line is that there was a violation in this design of the #1 rule (or at least one of the top three rules) of synchonizing between clock domains, and that is, "Synchronize only at one point in the design." If what you describe about the write signal is true, then the write signal was synchronized at two points and the resultant output of the two synchronizers will occasionally mismatch.
I have to agree with the comments that what is typically synchronized across the clock domains is the gray-coded address and not the read and write signals which should be synchronous to their respective clock domains. If the two empty signals from both FIFOs were logically OR'ed and the two full signals from both FIFOs were logically OR'ed, then this might have worked, but I would have to give this some thought. The better solution, of course, is a design where only one set of flip flops is synchronizing the gray-coded address, i.e. a FIFO design with a parameterized data width.
Dave and Muzaffer, this bug was quite a few years ago and I retained no confidential notes from that employer, so I'm relying only on imperfect memory. The read signal is indeed synchronous with the FIFO, but my recollection is that the write signal went directly into a synchronizer in the control logic of the 32-bit FIFO. Thus the write signal split into two synchronizers in the 64-bit FIFO, with the problem I described. But even if the problem occurred deeper in the control logic the root issue is the same - at some point signals that are supposed to be identical cross the clock boundary and may no longer match. Eric is correct that metastability as well as routing difference could trigger the mismatch. Thanks to all for the comments!
Good post. On the issue, I agree w/John's comments. Parameterized designs are superior for library building blocks which would have avoided this issue by design.
One point of clarification: even if all signals are delivered w/identical timing, it still would have been broken functionally as no 2 flip-flops have identical metastability characteristics. Given there were 2 FIFOs so 2 sets of pointers, there were 2 bits changing for any given operation (write or read) for what design-wise was intended to be a grey-code (1 bit changing) for the entire datapath vs. 1 bit per half datapath. So the design was functionally flawed (the net pointer for the entire datapath was not a grey-code), independent of timing/silicon implementation characteristics.
I agree a formal verification tool should catch this issue easily.
Actually the description of the bug doesn't make too much sense to me. Read/write signals are synchronous to their respective side so they should be checked for timing and they're not propagated to the other side so they couldn't have been involved in the bug.
Most likely the problem was that the two gray-coded buffer pointers didn't move to the other side in lock step which is perfectly possible and expected as the clock incrementing them and synchronizing them on the other side are async to each other.
This bug doesn't make sense to me as described. The write or read operation on a given interface will occur on the clock edge for that interface as the controls signals are synchronous to the clock on that interface. I believe the hazard would lie in the fact that there are two sets of control logic, one for each FIFO, which maintain their own sets of pointers. This could create race conditions where the full/empty indication being used from one FIFO does not accurately represent the state of the second FIFO.
Twenty years ago I saw the exact same design flaw in a board design using two 4-bit FIFOs to move a byte of data around. The engineer used the flags from one FIFO and ignored them from the other. The product checked out in the lab, but once in production we started noticing problems crop up on some (but not all) cards. Since the engineer didn't add any pads for spare ICs (his design was 'perfect' and therefore should never require a wiring mod), we had to re-spin the card as dead bugs had to be added to logically AND the flags of the two FIFOs together.
Thanks for two good points. I believe that the gray-to-binary and binary-to-gray conversions can be parameterized as long as you use an algorithmic approach rather than a look-up table. And you are absolutely correct about the game of chicken. My team was not responsible for the entire chip and so I did not know for sure whether any of the other bugs would have also required a re-spin. But our bug could not be fixed without a re-spin and so we got the blame. And we deserved it.
Lesson Learned: Do not hard code magic numbers into your design. If the original design had made WIDTH a parameter with a default of 32 then this never would have happened. The reuser would have simply bumped it up to 64 and everything would have worked fine.
Remember it is all about winning the game of "respin chicken". If you have an issue but can reconfigure the chip as a work around then you are not blamed for the respin. But when you have one must have feature bad and no work around then they get the blame and everyone else gets to piggyback their fixes onto that respin.
Thanks; I'm glad that you enjoyed it. I'm trying to remember enough about the other bug to blog about that too.
Thanks for sharing this painful experience. Good clear description.