Any why it can be so hard
If AWS only provides a bare-metal server, you will need to pay per request or per month for their node detection service. That will be harsh for them not to provide this service, but it will be expensive for companies to have to pay for such a feature. Detecting if a node is dead sounds like a very simple process. However, it is actually a very hard process. We often grant third-party cloud services with this simple feature to help us monitor our node’s health.
To tolerate faults, we need to detect them. However, in this article, you will see how hard it is to detect node failure. We will also discuss a high-level architecture design that detects node failure with phi accrual.
The slowness in the network is like traffic congestion in Disneyland. Imagine you are waiting in line to ride Space Mountain. At the front of the queue, you see that the waiting time is ten minutes. You may think to yourself, “Ten minutes is not a long time.” Hence, you wait in line. A few minutes pass by.
You start to see that you are almost in the front of the queue, then realize there is a longer queue in front of you that requires you to wait an additional 30 seconds. Latency works a similar way.
When packets are sent from your machine to the destination machine, they travel through the network switch, and it will queue them up and feed them into the destination network link one by one.
Once it goes to the network link in the destination machine, if all CPU cores are currently busy, the incoming request from the network will be queued up by the operating system until the application is ready to handle it.
TCP performs flow control (backpressure) that limits the number of nodes sent across the network to alleviate the node it contains in the network link. Therefore, it has another layer of the queue for the packets in the network switch layer.
Imagine if you are running a single program. Even though the program didn’t crash, it is slow and buggy. There is no stack trace in the program that mentions which part of the function or method is not working. This program will be much harder to detect failures than the previous fully failure scenario. This sort of failure is what is called partial failure.
If you are running a single program, if one part of the function is not working, the entire program will usually crash. By then, it showed up a stack trace that you can inspect further to understand why the system crashed.
Partial failures are much harder to detect because they are not either working or it doesn’t. There are numerous possibilities why the program is “having a bad day.”
Since distributed systems don’t have a shared state, partial failure happens all the time.
If you didn’t get any response, that doesn’t mean the node is dead. These are a few reasons the node may have failed:
- The message was sent to the network, but it got lost and the other side didn’t receive it.
- The message may be waiting in a queue and will be delivered later.
- The remote node may have failed.
- The remote node may have temporarily stopped responding because of garbage collection.
- The remote node may have processed the request, but the response is lost in the network.
- The remote node may have been processed, and it may have responded, but the response was delayed, so it will be delivered later.
If the network calls didn’t get a response back, it could never know the state of the remote node. However, you should expect no response back most of the time. What should the load balancer or monitor service do?
Usually, load balancers will constantly send health checks to check if the service is in good health. When a remote node is not responding, we can only guess that the packets are lost somewhere in the process.
The next action will be either retry or wait for a certain time until a timeout has elapsed. The retry option may be a little dangerous if the operations are not idempotent. Hence, timeout is a better way, as doing any more actions if you get no response may cause unwanted side effects, such as double billing.
If we want to make the timeout approach, how long should the timeout be?
If it is too long, you may leave the client waiting. Thus, having a bad experience on the web.
If you make the timeout too short, you may get a false positive, marking a perfectly healthy node dead. For example, if the node is alive, it has a longer time to process certain actions. Prematurely declaring the node dead and having other nodes take over may cause the operation to be executed twice.
Furthermore, once the node is declared dead, it needs to delegate all of its tasks to the other nodes, leading to more load on the other node. This may cause a cascading failure if the other node already has a lot of loads.
The right timeout period is based on application logic and business use cases.
A service can declare the operation timeout after an x amount of time if the users tolerate that time. For instance, the payment service can set seven minutes as the timeout period if seven minutes won’t cause a bad user experience.
Many teams detect the timeout period experimentally through trial and error. In this scenario, the timeout we set is usually constant. For instance, within seven minutes or five minutes, etc.
However, a smarter way to detect timeout is to not treat timeout as a constant value but consist of a distributed variance. If we measure the distribution of network round-trip times over an extended period and over many machines, we can determine the expected variability of delays.
We can gather all the data of the average response time and some variability (jitter) factor. The monitoring system can automatically adjust timeouts according to the observed response time distribution. This method of the failure detection algorithm is done with a Phi Accrual failure detector, which is used by Akka and Cassandra.
Phi Accrual failure detector using sampling fixed window size for each heartbeat to estimate the distribution of a signal. Each time, a new instance calls the heartbeat to the remote node, it will write the response time to the fixed window. The algorithm will use this fixed window to get the mean, the variance, and the standard deviation of the response time. If you are interested, here’s the formula for detecting phi.
Thus, in the next section, we will briefly touch on the high-level design of the Node Failure Detection.
We will use the node failure detection component consisting of two things: the interpreter and the monitor.
The interpreter’s job is to interpret the suspicion level of the node. The monitor’s job is to receive the heartbeat of each node and delegate the heartbeat time to the interpreter.
The monitor will constantly send a heartbeat to each remote node. Each time it sends a health check to the remote nodes, it will receive a response within a time. It then sends the response time to the interpreter to detect the suspicion level of the node.
There are two ways of placing the interpreter: centralized and distributed.
The centralized way is to place the interpreter and the monitor as its own service. After that, the system interprets each node and sends the signal to other nodes for further action. The result will be a boolean value, whether suspicion or not.
The distributed way places the interpreter in each application layer — letting the application have the freedom to configure the level of suspicion and the action it should take on each level of suspicion.
The advantage of the centralized way is that it is easier to manage nodes. However, the distributed approach may fine-tune or optimize each node to be differently based on different suspicion levels.
We can use the Phi Accrual Failure algorithm for the interpreter we discussed in the previous section. We set a threshold of what phi should be. If the phi result is higher than the threshold, we declare the remote node dead. If the phi result is lower than the threshold, the remote node is available.
While the monitor sends the request to the remote node, the interpreter starts timing the response time. If the remote node takes longer than the threshold to respond, the interpreter can stop the request and declare the node as suspicious.
We never think about detecting a node failure when designing an application because it is a built-in feature in each cloud provider. However, detecting a node is not an easy operation. One of the reasons is the no-shared state model in distributed systems. Engineers need to design a reliable system in an unreliable network.
Most of the time, companies do trial and error for detecting node failures. However, instead of using a boolean value to determine if a node is dead, we can approach them in variability. This will show the distributed variance of when a node fails with a Phi Accrual failure detector and set up a threshold level for timeouts.
Lastly, the high-level abstraction design for a node detection failure can consist of the monitoring component and the interpreter. The monitoring will constantly send a heartbeat to remote nodes and delegate the response time to the interpreter to analyze the suspicion level.
If a node reaches a certain threshold of suspicion level, the interpreter returns a boolean value to the service that calls them to indicate additional action needed.
Do you have any other ideas on how you can detect a node failure in a distributed system?