In a recent blog post, I talked about learning a public lesson on the importance of software verification while an intern at Digital Equipment Corporation (DEC). Since I spent most of my early career as a logic designer, not a programmer, I figure that an example of a corner-case condition from that part of my life would also be nice to share. This story will doubtless remind you of a well-known "divide bug" that appeared in a certain microprocessor in the mid-90s.

From 1985 to 1988, I worked at Cydrome, a mini-supercomputer startup whose very-long-instruction-word (VLIW) machine had quite a few novel aspects. I initially worked on the floating-point unit, with primary responsibility for the adder/subtractor. Anyone who has worked with floating-point numbers using the IEEE 754 standard knows that subtraction of two numbers that are close in value can result in a number that is "denormalized" with leading zeros in the mantissa. The usual way of handling this situation is to shift the result mantissa left to eliminate the leading zeroes while decrementing the exponent correspondingly.

It's also necessary before an add or subtract operation to align the two operands, generally by shifting the smaller operand right while increasing its exponent. My colleague Craig Nelson had the clever idea of merging the post-operation normalization into the pre-operation alignment to speed up overall latency. He developed a slick algorithm to predict when denormalization would occur, accurate to within one bit. Thus, we could replace the slow, complex result mantissa shifter and exponent decrementer with a fast, simple multiplexer.

Craig developed a proof for his algorithm that seemed solid to all of us who reviewed it, but of course it was still important to verify my logic implementation. This verification was even more important because one of the interesting aspects of the algorithm was that its implementation was non-intuitive, involving what seemed like random logic operations on random bits of the two operands. This is not always the case in logic design; for example the following well-known equations for a four-bit carry-look-ahead adder have a clear pattern that can be verified by inspection:

C1 = G0 + P0 * C0

C2 = G1 + G0 * P1 + C0 * P0 * P1

C3 = G2 + G1 * P2 + G0 * P1 * P2 + C0
* P0 * P1 * P2

C4 = G3 + G2 * P3 + G1 * P2 * P3 + G0
* P1 * P2 * P3 + C0 * P0 * P1 * P2 * P3

In contrast, the following actual fragment of my gate-level adder schematic (this was before commercial RTL synthesis) has no discernable pattern in terms of which bits of Bus A and Bus B are combined in the various gates:

We took a two-step approach to verifying this unusual design. First, Craig rigged up a program that generated random floating-point values with random add and subtract operations. The resulting calculations were performed on the Apple Macintosh, one of the few commercial implementations of the IEEE standard available at that time, and compared against the results from a C implementation of the algorithm. I then took a subset of these tests and ran them against my implementation in logic simulation using a simple testbench that fed in the values and operands and then checked the results.

Quite late in the process, at the
point where I had fairly high confidence in the correctness of my implementation, logic
simulation reported a miscompare with one expected result. After
spending a
couple of hours tracking the problem down, I found the bug -- a single
mis-numbered "ripper" on a single bit of one bus on one of the eighteen
pages
with logic similar to the fragment above. I have to admit: that bug
shook me
up. A simple typo that I had missed on repeated visual inspection of the
schematics also
slipped through a __lot__ of test cases. I was fortunate that the
random
tests happened to catch the bug, and that I had continued verification
long
enough for this catch to occur.

When the infamous microprocessor
"divide bug" cropped up in the industry a few years later, I had a
strong sense
of *déjà vu*. As with my subtract bug,
the vast majority of operands would work just fine, but every once in a
while
the answer would be wrong. We usually think of corner cases in terms of
combinations of control signals, or of obvious data values such as min
and max,
but with some designs the corner cases are not at all intuitive. The
only way
to catch them, of course, is to verify, verify, and verify some more.

Tom A.

*The truth is out there...sometimes
it's in a blog.*

## Share Your Comment