The Black Swan Events in Distributed Systems | by Ashutosh Narang | Apr, 2022

Photo by Markus Spiske on Unsplash

Distributed systems are just what their name suggests systems that are spatially separated and built using more than one computer.

There can be multiple reasons to build a distributed system. Some of them are…

I’m a daredevil!

Original image by KC Green

No no… no… just joking, umm maybe not… okay definitely joking.

Well, computers are physical things and like all physical things that do some work, computers also undergo wear and tear. Sooner or later they will break down.
If you build a system using just one computer and someday that computer breaks down well, you are not going to feel very good about that day.

If on the other hand, you have built your system with more than just one computer and one they go down, the overall system may still be okay and functioning well. You’ll be alright.

Of course, computers breaking down isn’t the only thing that can cause incidents in your distributed systems.
There might be a bug in the code or the system may be under an increased load.

When a system is asked to do more work than it possibly can, something is eventually going to fail. Maybe the CPU usage is very high that it has become the bottleneck and user requests start to time out suggesting the users that system is down. Or maybe the disk space has become the bottleneck and system can not store any more data.

In a normal system overload case, if the source of load or trigger is removed, the problem goes away and everything sets back to a normal state or stable state.

For example, maybe the system is under a DDoS attack eating up all of the bandwidth causing actual user traffic to be dropped. The system is now in a vulnerable state, an undesirable state that causes system unavailability.

Once we put a blocklist to stop the traffic from offending IP addresses, the trigger will go away, load on the network returns back to normal, user traffic begins to go through and the system comes back to its stable state.

It’s hard to prevent such overloading incidents but, usually easy to recover from them.

There’s another class of overloading incidents that are much harder to resolve, where the system does not recover back to its stable state by just removing the initial trigger these incidents can cause system outages down for a long period of time, it could be hours or even days in some cases. This class of incidents that continue to keep going even after the initial trigger has been removed are called metastable failures.

We’ll get into what that actually means soon but first, let’s look at a real-life metastable failure.

This story happened way back in 2011 when I was still in high school. Since these type of events happen ever so rarely, when they do happen they cause havoc, they are so important, that we should know about them to learn from these events.

They are infamously ingrained in the history of distributed systems.

So, the issue affected EC2 customers in a single Availability Zone within the US—East Region involving a subset of the EBS volumes that became unable to service read and write operations. This issue caused service downtime to this subset and customers were impacted for many hours and in some cases several days.

The trigger for this incident was a network configuration change that inadvertently routed all networking traffic for a subset of EBS servers in one of the clusters to a
low bandwidth secondary network instead of a high bandwidth primary network.

The secondary network was never designed to handle large amounts of load so it started to funnel all the traffic to a backup network.

The backup network itself was overloaded very quickly. This made it hard for the servers to talk to each other and lead to read-write failures on the virtual
disks in the impacted servers. So far, it seems like a normal overloaded failure.

The network change was quickly rolled back and the traffic began to go through the high bandwidth primary network again. Hence, removing the trigger of the overload.

So, EBS servers use a buddy system where every block of data is stored on 2 different servers for reliability. When a server loses connection to its buddy, it assumes that its buddy has crashed and now it has the one and only copy of customer data. Now, this server quickly tries to find a new buddy in the hope of replicating customer data. This process is called re-mirroring.

In this scenario, since there was a big network outage, when the network outage was resolved, there was a big group of EBS servers, all looking for a buddy to mirror the customer data.

Maybe at this point you can already guess what happened next, as they called it a re-mirroring storm.

Cool name, I must say.

Soon enough all of the disk space in the impacted cluster was scaled up cause each server was trying to make a copy of its data. We can imagine that must be a lot of data, requiring a lot of disk space and a lot of network bandwidth.

This left a lot of servers stuck in an aggressive loop:

This searching of a buddy placed a wide region additional load on the control plane servers, that are for coordinating requests for EBS servers and, in turn overloading them and knocking them offline for several hours.

The control plane has a regional pool of available threads it can use to service requests. When these threads were completely filled up by the large number of queued requests, the EBS control plane had no ability to service API requests and began to fail API requests for other Availability Zones in that Region as well.

— From AWS Documentation

Now, this was not only impacting requests from the degraded cluster but other clusters as well.

What do we do when nothing seems to work? Turn it off and back on again.

Eventually, they cut network connectivity to the degraded EBS cluster and basically threw it offline in order for the control plane servers to restore and start serving requests from healthy clusters.

To make everything get back to normal, additional capacity from other data centers in that region had to be brought in to add to the impacted cluster for increasing the disk space so this re-mirroring storm could calm down.

These types of self-sustaining failures that feed themselves and continue to persist even after the initial source of load or trigger has been removed are known as metastable failures.

Metastable failures manifest themselves as black swan events ; They are outliers because nothing in the past points to their possibility, have a severe impact, and are much easier to explain in hindsight than to predict.

— From the paper, Metastable Failures in Distributed Systems

The authors of the paper present a model describing the state transition of a system from stability to metastability.

We all want our systems to be in a stable state. Well, maybe not all of us, remember the guy from the beginning of the post who is (or was) a daredevil after all.

Anyways the stability of a system may be impacted temporarily due to an increase in load transitioning the state from stable → vulnerable.

We hope that once the load is removed the system should go back to its stable state.

As we saw in the EC2-EBS saga, that’s not what happened there, and instead of the system going from a vulnerablestable state after removal of initial load, the system went into a metastable state and remained there for a sustained period of time.

Since these types of failures are hard to predict, we can only form general observations from past incidents, and take those into consideration while
designing our system.

Put a Cap on Retrying Requests

If you are making a call to a downstream service and that call times out in most cases it’s normal to give it a retry, it was probably just a blip in the network, and if you retry it will be fine. If the call to the downstream service timed out because it was overloaded and you retry, you are adding load to the already overloaded system.

Here, a circuit breaker on the client side that keeps track of requests failures and notices that, okay I have called a service 10 times and it’s timing out, maybe I’ll give it a break and throttle my request so, as to give it some time to recover from the load.

Dampen The Feedback Loop

The feedback loop is the self-sustaining effect that keeps the system in the metastable state.

If the system is in an overloaded state that causes failures, the system may respond to these failures by changing its behavior. This change however small may put just enough load on the system to push it over the edge, where the system enters a self-
sustaining feedback loop putting more load on the system and it enters a metastable state.

If we can identify feedback loops ahead of time or soon enough, we can try to limit their impact on the system. There can be infinite reasons for a system to come into
metastable state and therefore infinite approaches to deal with them.

The circuit breaker is a solid solution that is becoming very common is distributed systems to prevent systems from going into a metastable state.

Hospitals During a Surge

So hospitals are built with efficiency in mind, if there’s a load on one hospital traffic can be re-directed to a nearby hospital but they are not designed to handle surges which is what happened during covid. Hospitals do not have a lot of extra capacity, than their normal load but across hospitals, the load can be managed.

As long as the load isn’t increasing everywhere, the global system can it but when the load increases absorb everywhere we enter a metastable state where the whole infrastructure seems to collapse.

JVM Garbage Collector

Normally, the Garbage Collector would be working just fine, and, once you give it just a little bit too much garbage suddenly your JVM has to do garbage collection and it’s throwing out all the compiled code and then it has to recompile the code which takes more memory which requires more garbage collection and so on and on until you kill it.

In this regard multi-threaded systems are approximately distributed systems.

Can you enter metastable failures if there’s no recovery strategy?

Well, I hope you enjoyed reading this as much as I enjoyed researching and writing about the topic.

Until next time. Thanks for reading.

Want to Connect?If you like my content, consider subscribing to my newsletter :)

Leave a Comment