What about the latency, stupid
Today, low-latency communication is an important feature - and for good reason! Many applications require low latency communication networks to function properly. But latency, a.k.a. the delay between when a data packet is sent and when it is received is not an easy metric to improve.
In this post we look at how to achieve reliable and low latency communication over an unreliable network using modern erasure correcting codes (ECC). We show how this approach compares to traditional systems though a set of interactive visualizations.
First, we discuss why low latency is such a hard goal to achieve, but if you want to jump straight to the visualizations you can use the links below:
It’s the latency, stupid
In Stuart Cheshire’s famous rant “It’s the latency, stupid” published in 1996 he outlines two facts about networking that are still worth repeating today:
Making more bandwidth is easy: You can always buy more capacity. Upgrade your Internet connection, buy more satellite air time or lay down more fiber optic cables. Going from 10 Mbit/s to 10 Gbit/s is just a question of throwing more money at the problem.
Once you have bad latency you are stuck with it: Latency is limited by the speed of light plus the processing delays added by the software and hardware components used. The speed of light naturally sets a fundamental lower limit on latency between any two points. As an example, sending a bit between London and Sydney is going to take at least 80 ms, no matter how much capacity you have, because that is just how long it takes for light to travel the approximately 17000 km from London to Sydney though a fiber optic cable.
To make matters worse, many applications that need low latency also require a reliable connection. This means that any lost data packets must be repaired before the application can use the incoming data. As we will see later, depending on which strategy we use to repair lost data, reliability can have a significant impact on latency.
The following table shows a few examples where low latency and reliability are key enablers:
Industry | Application |
---|---|
Multimedia | Live streaming, video conferencing, remote desktop |
Entertainment | Immersive entertainment, online gaming |
Healthcare | Remote robotic surgery with haptic feedback, remote diagnosis with haptic feedback, emergency response in ambulance |
Transport | Driver assistance applications, enhanced safety, self-driving cars, traffic management |
Manufacturing | Motion Control, remote control with AR applications |
Source: Business Case and Technology Analysis for 5G Low Latency Applications
The above-mentioned applications have different requirements for both latency and reliability. For example, 150 ms is the maximum tolerable latency for video conferencing according to Wowza media systems - above that limit it becomes difficult to have a conversation and reliability is quite important, but nobody will die if the picture breaks up once in a while. Compared to self-driving cars, where estimates are that latency needs to be below 5 ms with very high reliability requirements according to Ericsson in this presentation (slide 8), a failure can have fatal consequences.
Today end-to-end transport protocols typically guarantee full reliability, i.e. all packet losses are repaired (e.g. TCP) or make no explicit effort at correcting losses (e.g. UDP). For most of the applications we mentioned above, the latter is unacceptable: we certainly want some if not full protection against packet loss. The problem is that to provide full protection against packet loss transport protocols typically use a mechanism called Automatic Repeat Request (ARQ) - and this can be problematic in low latency scenarios.
ARQ Visualization: A Tutorial
To see how ARQ works we’ve built up a small step-by-step visualization. In the visualization we’ve made a few assumptions:
The link has a latency of 200 ms (in both directions).
A new packet is produced every 200 ms.
To be clear the visualization is not meant to provide a realistic simulation - but rather show the concepts and their trade-offs.
To get started press the “step forward” button below:
With ARQ there are fundamentally two ways for a receiver to notice that a packet has been lost:
It sees a packet with a higher packet ID/sequence number arrive. In which case, the gap can indicate a loss (this could also have been caused by out of order delivery).
The receiver has an expectation for when the next packet should arrive and detects that no packet was received.
Regardless of the detection approach used, the decision to send a retransmission request depends on the latency of the link.
Example: if the link has a latency of 50 ms the fastest possible reaction to a packet loss from a receiver would be right after the expected arrival of the packet. Which is 50 ms after the packet was sent. The retransmission request will also need 50 ms to travel back to the sender. The sender can then schedule the retransmission, which will arrive after an additional 50 ms. In total, this sums up to a delay of 150 ms - at best.
To summarize we can say that:
Using ARQ the minimum latency penalty from a packet loss is 3x the link latency.
This means that if your application requires 20 ms latency and your link has 10 ms latency, ARQ will not be able to deliver the needed performance.
So how can we address this problem? One solution is to throw more bandwidth at the problem. Using modern erasure correcting codes (ECC) we can use mathematics to introduce repair packets in the stream. Strictly speaking using ECC does not require more bandwidth than ARQ - a packet lost is a packet lost and whether we fix it by a retransmission or by using ECC does not make a difference, from a bandwidth point of view. However, in practice we will often generate more ECC than strictly needed to drive the latency down. The amount of extra bandwidth used will depend on the application and its latency requirements. However, before we get lost in the details lets take a look at how ECC would work.
ECC Visualization: A Tutorial
Before we dive in to the ECC visualization, here is a quick overview of how ECC works.
An ECC algorithm takes data packets as input.
Based on the input, it produces additional ECC repair packets.
These ECC packets can be used to repair any of the input data packets, should they be missing.
Example: if we have two data packets $A$ and $B$ (100 bytes each), we can now generate an ECC packet $R$ (also of size 100 bytes). Receiving any two out of the three packets will allow us to decode $A$ and $B$. If for example $B$ is lost and we receive $A$ and $R$ we can still decode $B$.
There exists a wide range of different ECC algorithms and in recent years we have seen algorithms especially suited for low latency applications starting to appear. Going into details on how these new low latency algorithms work are beyond the scope of this post. But the general gist of them is that they no longer require a block of data to be available before repair packets can be created - this makes them very well suited for low latency applications.
In the ARQ tutorial we showed how retransmissions could be used to repair packet loss - at the expense of added latency. In the following we will show how we can use ECC to generate repair packets and thereby avoid the retransmissions and lower the overall latency (of course, ARQ and ECC can also be mixed in protocols where it makes sense).
We use the same assumptions on link latency as in the ARQ tutorial. In addition to this we generate an ECC packet for every two data packets. This yields a bandwidth overhead of 50%.
The ECC code used in the visualization is an ideal code, meaning that for every repair packet we are able to decode one original packet. In practice such a code does not exist, we would therefore need to spend more bandwidth with a less ideal code to obtain the target latency.
As shown in the above visualization, we can spend bandwidth to gain better latency properties. For highly latency-sensitive applications, this just might be the trade-off we are looking for.
Comparing this to the ARQ scenario:
Using ECC we can come arbitrarily close to the link latency at the cost of additional bandwidth usage.
Similarly operating an application with a 20 ms latency requirement over a 10 ms link would be possible using ECC. Given we have enough capacity on the link to overcome the packet loss rate.
Another interesting property of ECC systems is that they can operate without a back channel (from receiver to sender). As long as the rate of ECC packets produced is higher than the rate of packet loss - ECC can guarantee delivery. We showed that in this visualization by removing the back channel. In a real application the back channel could be used to optimize the ECC rate according to the observed packet loss rate. Or a hybrid ARQ/ECC protocol could be built.
Comparison: ARQ vs. ECC
Here we’ve aligned the two visualizations such that you can run them side by side and get a feeling for the relative performance. We’ve also added a few extra knobs to the visualization such that you can control:
The speed of the visualization (we send in total 1000 packets, so the visualization will stop after this)
The round trip time (RTT) of the link in ms. Round trip time is the time it takes for a packet to travel from the sender to the receiver and back. If you have a link latency of 100 ms, the RTT will be 200 ms (assuming symmetric delay).
The error probability, i.e. the probability that a data packet is lost during transmission.
The ECC ratio, which is the number of repair vs. data packets sent on the link.
Conclusion
Low latency + reliability is going to be critical for a large number of new innovative applications (remote control of machines, AR/VR, etc.). For these applications ECC could be a key technology enabler to:
Avoid unpredictable latency spikes due to packet loss and ARQ protocols.
Allow applications to work over longer distances. If retransmissions are too costly in terms of latency, ECC algorithms can be a solution.
Improve the worst-case latency behavior. Using ECC we can cut the tail of the latency distribution.
For many latency sensitive applications, ECC would provide an elegant and efficient solution.
If you want to start testing ECC today, we’ve got you covered with high-performance, cross-platform software libraries ready for your application.
Get in touch and let’s discuss how you can start to integrate ECC today!