Erasure Correcting Codes
Erasure correcting codes (ECC) are commonly used to increase the reliability of unreliable systems. The two main usecases being communication and storage systems.
In the following we will mostly take the communication point of view and will therefore be speaking about networks and packets. However, we can apply similar reasoning about how ECC work in storage systems by exchanging packets with disks or blocks of data stored on disk. In both systems we are typically interested in protecting against data loss. However, as we will show at the very end of this page ECC, also has other usecases and applications.
Background
Many people in particular with a communication or storage background may already be familiar with the use of codes for different purposes. In general we have three different categories of codes:
Error detection codes: The purpose of this type of codes is to detect whether a data packet has been corrupted. This means that some of the bits (zeros and ones) describing the data has been flipped compared to their original value. These flipped bits represent an error. To detect this situation a checksum is calculated based on the data. The checksum is then appended to the data and we can at a later point in time recompute the checksum and compare it with the appended checksum to verify that the data is error free.
This category of codes includes the widespread Cyclic Redundancy Check (CRC) codes. Which are widely used in networks and storage systems today.
Error correction codes: The main purpose of codes of this type is to enable reconstruction of an errorfree data packet given some corrupted data. Typical codes from this category will compute some redundancy bits that described the original data. Using this redundancy a receiver can repair up to some amount of bit errors. This type of codes are typically used at the lower layers in the networking stack e.g. at the physical/data link layer.
Erasure correcting codes: The two previous categories of codes are in more widespread use than the codes from this category. Whereas the error detection and correction codes all typically work on the bitlevel of a single data packet. An erasure correction code works across data packets.
They can be used in situations where the error correction code fails. In which case the error detection code reports that data has been corrupted and the data packet must be discarded. This type of data loss is called an erasure. To protect against erasures we can use an erasure correcting code which creates redundancy packets that can be injected into the packet stream in a similar way as we inject redundancy bits into the individual packets using an error correcting code.
In the following we will take a closer look at how erasure correcting codes work.
Erasure correcting code simplified example
As mentioned in the previous section one of the main usecases for ECC is to deal with packet loss. To illustrate how this work we can imagine a simple setup where we want to transmit two packets A and B from a sender to a receiver as shown in Figure 1.
Figure 1: Example scenario with a source sending two packets A and B to a receiver.
In this case we want to protect the transmission from a single packet loss. To do this we need to create one redundancy packet (let’s call it R1) that we can transmit such that any two out of the three transmitted packets will be useful for decoding A and B.
This can be done using a very basic erasure correcting code. Where R1 is simply the XOR of A and B as shown in Figure 2.
Figure 2: Encoding and decoding packet A and B using a simple XOR. Any one out of the three packets can be lost and we will still be able to recover packet A and B.
To further dig into the benefits of erasure correcting codes we made this example where we are transmitting data to three computers over a lossy network.
Reliability
In the example with packet A, B and R1 one might ask is how much did the reliability of the system increase?
To answer this question we need to make some assumptions; let’s assume that every time we send a packet it is lost with 25% probability and that all transmissions are independent (i.e. the probability of successfully sending a packet is not dependent on previous transmissions). In the noncoding scenario the probability of getting both packet A and B is the probability that A goes through multiplied with the probability that B goes through. Doing the math yields $0.75\cdot 0.75 = 0.5625$ which is 56.25% probability of success. Let’s look at the case when sending R1. In this case we have three transmissions, of which at least two must go through. Without the use of any fancy mathematics we can simply enumerate all the cases (we denote s as a successful transmission and f as a failure): {fff,ffs,fsf,fss,sff,sfs,ssf,sss}. We see that there are 4 events that have at least two successes, now we just need to sum the probability of those. Remember s has a probability of 0.75 conversely f has a probability of 0.25. There are three cases where we have two successes and one failure those have a probability of finally we have one event with three successes that has the probability summing these we get which is 84% probability of getting both packet A and B at the receiver.
So we have increased the reliability of our system by 27.75%.
The cost of doing this is an increase in the bandwidth usage of our system as we now send three packets for every two data packets. So our bandwidth usage has gone up by 50%. Using more sophisticated codes we can tune the bandwidth vs. reliability tradeoff for a specific application or scenario.
However, what if we wanted to increase the reliability of our system even further?
Say we wanted to send four packets, where any two out of the four would produce packets A and B. To achieve this we would have to take a deeper look at the mathematics behind ECC. First lets look at the mathematics in our initial example: In initial example we used an XOR to produce the redundancy packet R1, in mathematical terms what happened was that we created an equation called over something mathematicians call a binary finite field. Lets take a closer look at how this works.
Finite Fields
Finite fields are the mathematical underpinnings of both ECC and many cryptographic systems. A simple example of a finite field can be created choosing a prime number, say 5. The elements of this field is {0,1,2,3,4} and any arithmetic operation we can perform on the field elements will be modulo our prime 5. So because modulo . In short any operation we do on the elements of our field will result in another element in our field. This means we can do computations on our data without any loss of precision. In practice almost all implementations are based on the binary finite field with the prime 2 (since these allow for fast computations on most computer hardware).
The binary finite field contains two field elements {0, 1} and to make R the sum of A and B we would set both c1 and c2 equal to 1, yielding the equation .
In general we can see that we can create four possible equations using elements from the binary finite field:
We see the first packet is just a zero packet it contains no information about neither packet A or B. The second packet is the sum of A and B, where the third and fourth packets are packet A and B respectively.
If you are familiar with linear algebra you may notice that any two out the three last equations are linear independent. This means that any two out of the three equations are sufficient to solve the equation system for packet A and B. For any ECC code we are trying to create equations describing our data packets that are linearly independent.
If we wanted to create additional useful (or linear independent) redundancy packets we would have to be able to create more equations. Which means increasing the amount of choice we have for the two coefficients c1 and c2 in our equations. From the mathematical perspective this means using a finite field with more elements in it. To do this we can use something called binary extension fields which allows us to use many of the operations from the binary finite field (keeping computations fast), but which increases the number of field elements we can use. Without diving too deep into the details we could for example use the binary extension field in which the coefficients each are three bits. We would now have eight choices for each coefficient. From this field we could generate a code that would allow us to decode packet A and B from any two out of up to seven equations.
In general it is possible to construct codes for almost any type of scenario.
Constructing erasure correcting codes
There are two key challenges involved in constructing such codes.

Making the finite field mathematical operations run fast enough to not degrade our network utilization. This can be achieved by carefully optimizing and tuning the finite field operations for usage on modern CPU architectures. This kind of optimizations is quite involved and are fairly complex. They are therefore often implemented in standalone libraries tailored for this purpose, one such example is our Fifi library.

The second challenge is how to select the coefficients used for the coding (i.e. designing how our equations are created). There are generally two approaches to this: one is called the algebraic approach and one called the probabilistic approach.
In the algebraic approach we construct the code using algebraic tools; this means that our code will be designed for a specific scenario (e.g. a network with a certain packet loss) for which it will perform extremely well. However, if there are dynamics in the system and e.g. the loss rate changes over time, then this type of coding may not work well. Traditional codes such as ReedSolomon fall into this category.
In the probabilistic approach we construct the code according to a chosen probability distribution. These types of codes are typically better suited for systems with dynamic behaviour. In addition newer codes in this category such as RLNC (random linear network codes) are very flexible in their design. This flexibility offer many advantages in terms of better utilization of the network, low latency applications and protection against packet loss.
In our Kodo library we provide implementations of both types of codes, which make it a convenient choice for most systems/applications.
Applications for erasure correcting codes
In this brief introduction to ECC we talked about how to create redundancy packets to deal with a lossy network. However, there are also other situations where ECC could be considered, the following list summarizes some of the situations where ECC can provide significant benefits and improve system performance.
 Live streaming: For streaming live data, there often is no time to go back and do retransmissions. In these situations we can use ECC to inject redundancy into the stream to deal with some amount of packet loss. How much redundancy to inject is a bandwidth vs. reliability tradeoff which typically depends on the network conditions and specific application.
 Blind sending: In certain applications no feedback channel is available from the receivers to the sender of the data. In those cases ECC can be used to increase the chances of sending useful data to the receiver(s).
 Peertopeer networks: ECC can be utilized in peertopeer networks to increase the probability that peers have useful data to exchange.
 Storage systems: In storage systems ECC is a potentially very valuable tool to ensure that data is not lost when spread over many disks in a data center.
 Highlatency systems. In communication systems with high latency, such as satellite systems, ECC can be used to avoid expensive retransmissions needed due to packet loss.