What the FEC is Forward Error Correction?

1 Mar 2018 • 8 minute read

eye diagram What is forward error correction (FEC)?

It is automatically correcting errors in transmitted network data (and sometimes detecting some errors that are too severe to correct).

The reason it is important is that communication channels are noisy, and so occasional errors will occur. This is usually measured by the bit error rate (BER) of the channel. A communication channel might be two chips communicating by SERDES, or a fiber optic channel under the ocean, or a radio link to a space probe. (I wrote recently about the New Horizons space probe where the round-trip transit time to the probe is now over 12 hours.)

It is called “forward” error correction because it can correct errors even in the common situations where there is no backward channel (or if the round trip would be 12 hours, which is almost the same thing). The same approach is used with DRAM and disk drives, where the basic nature is that there is no backward channel---you can’t somehow get the database program that ran a week before to go back and re-write the disk block.

If you don't have FEC, then either you just have to accept errors, or you need to use a back channel of some sort to request retransmission after an error is detected. Both of these are used in the real world, too. An old analog TV didn't worry about errors; if a pixel on the screen was wrong, then the person watching wouldn't likely notice. TCP/IP, the communication protocol that underlies the internet, checks packets for validity, ignores them if they are invalid, and then negatively acknowledges so that the bad packet is retransmitted.

But in many circumstances, there is either no way for the receiver to communicate with the transmitter, or it would be too slow to be of interest (like that space probe). It is especially important in high-speed computer networks and some chip-to-chip communication.

Paper Tape

When I was in high school, one of the achievements of the head of the math department and me was to persuade the school to purchase an ASR-33 teletype, so we could prepare punched paper tape. The paper tape had eight holes (and a sprocket hole that was used either to mechanically advance the tape or to optically sense when a line of holes was under the reader). A few things made paper tape a little more workable. First, no holes punched (except the sprocket hole), was a character that was ignored, so you could run blank tape and the start and end of the program (or even in the middle) without any problem. Similarly, all holes punched was a character that was ignored, so you could correct a mistyped character by pressing a button that backed up the tape one space, pressing the rubout key, and it would punch all the holes, thus erasing whatever was there before. The computer would ignore it when the tape was read.

But for this post, the most important thing was that the characters were encoded in the first seven bits, and the eighth bit was a parity bit, set to 1 or 0 to guarantee that there was an even number of holes punched in every character. If a hole didn’t punch properly (yes, dangling chads were a thing long before the Florida recount) then there would be an odd number of holes. I never ceased to be amazed when reading the paper tapes into the computer, which happened at 1,500 characters per second, that the reader would stop on an odd number of bits so fast that the character would still be under the line of light that was used to sense the holes. That is, the tape would shoot through the reader at over 10 feet per second, meaning that it actually arced gracefully through the air into a large bin, and then stop dead within a tenth of an inch.

With seven data bits and one parity bit, that’s the best you can do. You can’t correct the error, just detect it. Also, it only detects single bit errors. If two holes are mis-punched, the number of holes will be even and it will pass the parity test. Obviously, it will detect 3-bit errors, though.

Even this simple paper tape example exhibits another aspect that FEC has to cope with. The parity bits have to be transmitted using the same channel, with the same reliability. If the hole that was incorrectly punched were the parity bit itself, it still left an odd number of holes and the computer would stop on it.

If you use more parity bits, then it is possible to correct some errors as well as just detect errors. This means that a mis-transmitted bit can be fixed up at the receiver without requiring the data to be retransmitted.

Here’s a way you could do it. Transmit each bit three times, and at the receiver, use majority voting to derive the value. This works, but it is inefficient, requiring two parity bits for every data bit. When I say “inefficient”, this is measured by the code-rate, which is the number of data bits divided by the total number of bits. The transmit three times code has a code-rate of 1/3, which is not good.

Hamming Codes

The science of error correcting codes is working out ways to do this that are much more parsimonious how many parity bits are required. They are also designed with an eye on how they could be efficiently implemented in hardware; for example, by using linear-feedback shift registers.

The first reasonably good code was invented by Hamming in 1950, and a class of codes based on his original ideas are still called Hamming codes. They work well only when errors are rare, and multiple errors essentially non-existent. This is pretty much the case with DRAM, and Hamming codes are used in ECC RAM. Hamming’s 1950 code used seven bits to encode four data bits and three parity bits (so its code-rate was 4/7). It could correct any single bit error. If an additional parity bit is added (like the paper tape, making the number of 1s always even), then it can also detect any double bit error.

Over the years, codes have been developed with different characteristics. For example, for disk drives, defects are likely to affect more than one bit, but confined to a group of adjacent bits. Some codes, known as burst error codes, can correct quite large numbers of errors like this, provided they are confined to a burst of less than a certain number of bits. Wireless channels also have characteristics like this, when electrical interference (a flash of lightning, say) will affect several sequential bits.

The Shannon Limit

There is obviously a tradeoff in adding parity bits since they use up some of the bandwidth of the channel. This is really a tradeoff between reliability and the data-rate. If you increase the reliability to the max, then Shannon’s noisy channel coding theorem gives the limit on what the maximum data-rate can be. If you accept less reliability, then more of the bandwidth can be used for the real data, and so the data-rate goes up. You can go all the way to transmitting only the data, but no errors will be corrected.

Shannon’s 1948 paper that defined all of this completely out of the blue, basically created information theory and set its agenda for decades. It was theoretical and didn’t describe any practical coding scheme to get close to the limit. It took fifty years to get to that point. He called his limit the “channel capacity” but usually it is just called the “Shannon limit”.

Galois Fields

Underlying many of the practical codes is a lot of mathematics known as finite fields, or Galois fields. A field (in mathematics, it’s something completely different in physics) is a set with two operations, traditionally called addition and multiplication. There is always an additive identity (called 0, since any number plus 0 is unchanged), and a multiplicative identity (called 1, since any number times 1 is unchanged). All sorts of mathematics that was originally developed for numbers, like polynomials and matrices, work on fields, too.

Galois fields are very important in many encryption algorithms. You may have heard of elliptic curve cryptography. This relies on the fact that in finite fields, it is very easy to calculate exponents (raise a number to a power, typically very large) but there is no algorithm short of brute force to go backward and work out what the power was in the first place (called the discrete logarithm).

One particularly important field has just those two elements, 0 and 1. It is known as GF(2) for Galois Field 2. It might seem trivially small, but that’s a bit like saying “how can you build a 6B transistor microprocessor with just 0 and 1?” Similar fields with more than one bit are called GF(2ⁿ).

For example, an important class of error-correcting codes in communication are called Reed-Muller codes. I won’t go into the details other than to quote Wikipedia that:

Reed-Muller codes are related to binary functions on the field GF(2^m) over the elements {0, 1}.

Évariste Galois

If you study enough pure mathematics to get to group theory, fields, and rings, then you discover that finite fields are named in honor of the Frenchman Évariste Galois, who basically invented them. He also named the groups of group theory (although he was French, so he named them "groupes").

I should, perhaps, say finite fields were named in memory of Évariste Galois. In a story known to everyone who studies Galois fields, Galois was a brilliant mathematician. However, in 1832, when he was 20, he was challenged to a duel. Nobody knows why, it might have been to do with a woman but if so, it is unclear who. And it is not completely clear who his opponent was. He spent the night before the duel writing down most of the mathematics he had discovered, went out next morning, and got shot in the stomach, abandoned on the dueling ground, found by a farmer, and died later that day in the hospital.

The story seems to be somewhat romanticized and exaggerated as if Galois had never written down any mathematics before. But who can resist the story of the mathematical artist in his garret creating an entirely new branch of mathematics and writing it all down the night before he died?

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.