Get email delivery of the Cadence blog featured here
The last two days I have written about carry in mechanical calculating devices. See my posts Carry: From Logarithms to Mechanical Calculators and Carry: Babbage's Engines. We have similar problems in designing electronics. Most microprocessors do their arithmetic in 32-bit or 64-bit. But specialized engines for handling encryption are often doing arithmetic on very large numbers: 256-bit or 1024-bit.
All the mechanical mechanisms (except the Analytical Engine) used ripple carry. Each digit would have a mechanism to add one to the next digit up, and in turn, if that caused it to roll from 9 to 0, it would trigger a further carry, in the worst case, all the way to the highest order digit.
We have ripple carry in electronics, too. A full adder (for a single bit) takes three inputs: the two bits to be added, and the carry-in coming from the previous stage (except for the lowest order bit). It produces two outputs, the sum of the two bits and the carry-out to pass to the next stage. To build, say, a 64-bit adder, 64 full adders are used, with the carry-in of each one linked to the carry-out of the previous stage. The low-order bit can use a half-adder since there is (normally) no carry-in. I'm not going to draw a 64-bit adder, so here's a 4 bit one:
An adder is a combinational element, not a sequential (clocked) one, so performance is not measured in clock cycles but gate-delays. The problem with this scheme is that for, say, 64-bit numbers, the maximum delay is about 64 times the delay through a single adder, so 129 gate delays (the slightly odd number is because the first adder is a little different with no carry-in). Normally, this means that the clock cycle of the overall design (or at least that block) can't be faster than that, or it risks latching the value coming out of the adder before the carry has finished propagating. This would make for a very slow overall design.
The carry-lookahead adder is the electronic equivalent of Babbage's anticipating carriage. Instead of generating a carry-out, each full adder generates two signals, propagate (P) and generate (G). The G signal indicates that the adder will generate a carry even if it doesn't receive a carry-in from the previous stage. That is only true when both A and B are 1. The P signal says that the carry-in will propagate through that bit, which is true if either the A or the B bit (or both) is a 1. The P and G signals are not connected from one adder to the next, but instead into the carry-propagate logic. So a 4-bit carry-lookahead adder looks like this, where multiple 4-bit instances are joined together to build bigger adders:
The important thing about this is that the P and G signals are calculated fast and then, if the worst case occurs and a carry coming in from the right is going to be propagated all the way through the four bits and out to the next group of four bits, then this is known very early, without needing to propagate through all the adders. But then the real magic happens. If the C-out coming from the left of this adder is going to propagate through the next four bits, then this has already been calculated and so now carry propagates four times as fast. The logic for doing this is simple for the rightmost bit, but gets progressively more complex for higher-order bits since they need to take account of all the lower-order bits' P and G signals, too.
Four seems to be the traditional number of bits to use for each group of a carry-lookahead adder. The logic in the block at the bottom in the above diagram gets more and more complex as the number of bits is increased. But it is not just the complexity of the logic, the more bits in each group, the slower the carry is within the group (which doesn't benefit from the lookahead). Plus, in an on-chip implementation, the performance of a block deteriorates as it gets larger and the speedup gained from the lookahead is counterbalanced by the lookahead calculation being slower.
It is possible to go further. Each carry-lookahead block has a signal that says whether it will propagate a carry all the way through, and these signals (not shown on the above diagram) can be further combined into supergroups (say four adders) that means a carry can propagate 16 times as fast...and you can have supersupergroups, too.
One of the most famous carry-lookahead adders was in the 47181 four-bit ALU. Texas Instruments introduced it in 1970. This was a bipolar part, long before MOS became dominant. Many minicomputers of the 1970s were built using this ALU as the basis, such as the PDP 11/70, the Vax 11/780, and the Xerox PARC Alto. It is just a few gates, as you can see from the schematic below, but it implemented all logic and arithmetic functions and had a carry-lookahead scheme as I described above. Until levels of integration meant that high-performance microprocessors could be put on silicon, building processors out of building blocks like this was how computers were designed.
The four inputs at the top left, and the M input in the bottom left (which set the ALU into either Arithmetic or Logical mode), determine what function is executed (add, subtract, xor, etc). The inputs are on the left. The outputs are on the right. The complex looking logic in the central column is the carry-lookahead logic, going from fairly simple for the first bit to pretty complicated for the fourth bit (at the top).
To build a larger ALU, 16-bit or 32-bit, then either the 74181 chips could simply be joined up, with the carry-out from one connected to the carry-in for the next. You can see the carry-in as the lowest signal on the bottom left of the schematic, and the carry-out is the second signal down on the top right (along with the P and G signals, although TI seems to call them F and G). Alternatively, the TI 74182 lookahead carry generator chip could be used to implement the supergroups that I mentioned above to build 16-bit ALUs, taking those P and G signals from the 4-bit ALUs as inputs. You can then do the same trick again with another TI 74182 to get to a 64-bit ALU, as in this diagram from TI's datasheet:
Another type of adder is the carry-save adder. If all you want to do is add two numbers, then this doesn't gain you anything. But if you are adding lots of numbers to accumulate a big sum, as happens a lot in digital signal processing or hardware implementations of many encryption schemes, then it gains a lot. The idea dates back to John von Neumann.
The idea is to represent each bit in the result with two bits in the intermediate register. If you add 1 and 1 then instead of generating 0 and a carry, the answer is recorded as 2 (10). The carry only gets propagated the next time an add is done, and it gets added in, too. But note that this is the carry from the previous add, not the addition then taking place. When the result is finally required, any carry propagation that has been delayed needs to get done to convert from our weird representation back into normal binary, but during the accumulation phase, the carries from previous additions propagate slowly using the extra bits. The big advantage is that since the carries only move one bit at a time, which is fast, the clock frequency can be much higher.
There are different flavors of carry-lookahead adders. For some reason, they all seem to be the creations of two people: the Kogge-Stone adder, the Brent-Kung adder, the Han-Carlson adder, and Lynch-Swartzlander adder. They vary in tradeoffs of performance and cost of implementation in different types of technology. I'm not enough of an expert to know which of these are actually used in practice. Sometimes theoretically optimal ideas are only practical in extreme cases, such as doing fast multiplication by using an FFT, then additions, and then an inverse FFT. This is very close to doing multiplication using logarithms, which is where we came in two days ago.
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.