Application Self-Healing – DZone Performance

This is an article from DZone’s 2021 Application Performance Management Trends Report.

For more:

Read the report

Today, automation is one of the main goals in IT projects. Most platforms run on cloud infrastructure and are fully automated at the platform and infrastructure level. Companies are moving ahead with automation and extending it to disaster recovery. As a result, many applications are designed in this way to avoid failures and recover automatically. This is commonly called “self-healing”. In this article, we’ll walk readers through the self-repair and review some common features when apps experience failures.

What is a self-healing app?

A self-repair app is an app that detects failures and attempts to restore the situation before it escalates into a bigger problem. For users, self-repair apps reduce system downtime. For developers, self-repair allows them to spend more time developing rather than fixing problems. When something fails, the self-repair application continues to run, and the application repeats. In short, it tries to restore the app to its default state. The primary tasks of the self-repair feature are to troubleshoot problems. The self-repair application can automatically detect malfunctions, detect system errors, and stop the system automatically. For the purposes of this article, when we say an application, we mean the entire system to which the application or platform belongs.

self-healing levels

Every app or platform is not just developer code. The devices on which you run code and the third parties connected are also part of your application. It is very common for apps to depend on multiple third parties. On the other hand, it is easier to focus on the functionality of the main application and use third parties for other services, but on the other hand, if a third party fails, it can lead to the failure of your application.

Then, you have to take action and fix the problem. Moreover, when you run your application on the cloud, you are dealing with a virtual infrastructure and you are responsible for setting up and restoring your infrastructure strategy. Although all cloud service providers give a Service Level Agreement (SLA) and pledge to keep all services running at the highest possible level, it is up to you to ensure that your application is always accessible no matter where the failure is.

Before thinking about how to create a self-healing mechanism, you need to identify the points of failure. To design a self-healing system, you must have a comprehensive overview to monitor your application. Make sure not to leave anything off your radar. Then, you can identify potential scenarios and act accordingly to keep your application running all the time. To get a better picture of what needs to be done on the monitoring side, let’s break monitoring into some sub-areas.

Observation level

Monitoring is one of the most important parts of any application. The monitoring solution provides an observation of application behavior during runtime. In addition, we can check infrastructure performance, network transactions, and third-party availability. Setting monitoring is not just about the application itself, but everything related to the application, from the infrastructure to the application components and third parties.

Smart Alerts

When preparing alerts, engineers have to identify the warning and critical thresholds for each alert. However, people often start seeing notifications or emails about them as soon as they set up alerts. These notifications do not necessarily mean that there has been a failure – instead, they may say that the system has crossed a threshold. After a while, most engineers tend to ignore these alert notifications if they seem unimportant. This can lead to missing real issues among the many false positive alerts. The right way to monitor is to fine-tune your alerts so that each is something you should take seriously and act upon. If you have unimportant alerts, they should either be set or removed.

record everything

Logging is not necessarily part of monitoring, but what makes it important is the data you collect. By registering, you can record all events with the specified time and date. When something fails, the logs are your golden information that tells you what happened, when and where. That is why it is necessary to record everything to keep track of all the possible causes behind the failure. It is recommended to centralize the login to your monitoring system.

In most monitoring tools, you can connect the logging system to the monitoring setup, and the monitoring system will process the logging data. Intelligent monitoring systems can determine the relationship between application components, devices, and third parties. Therefore, they can generate a summary from monitoring data and log in at the time of failure, which helps to find the root cause faster.

Common failure areas

These are some of the common areas where we are seeing implementation failures. Let’s take a look at each failure area individually and the solution associated with it.

Network connection lost

One of the most common failures is the loss of network connectivity. Connection loss for the entire application or even within an application can occur between components, such as a database connection loss. The best method for self-healing these failures is to create a retry mechanism that increases the chance of recovery in a short period. A good monitoring tool is available to help solve these issues. You can easily trigger the retry process from the monitoring alert. Therefore, you can record an incident and resolve it automatically.

Inability to expand

When the number of requests on an application is greater than it can handle, the application starts to fail or is unable to process requests. The solution to that is to make the application scalable. Scaling is something that can be designed and handled at various stages of application development. The best place to think about scalability is architecture time, when you design your app with all the components. You can choose technologies that cover scalability in an automated manner. One example is the use of container- and tool-based architectures such as Kubernetes, which deal with scalability at different levels.

Long-term transactions

Failures occur in long-running transactions, and after each failure, the transaction must be started from the beginning. To keep these transactions flexible, you can create checkpoints that help understand at what stage the failure occurred. After that, the system can start the transaction and continue where it left off.

instance failed

If an instance of an application cannot be accessed, the only solution is to have another instance failover. This should be taken into account at the design stage, and instances should be added or removed based on the need. So, if the instance is a database, it can be copied to other instances for failover. If the instance is an application, you can use a load balancer or any traffic-distributor service and add instances behind it. Currently, all cloud providers support this feature as being high availability. So this should be configured at the same time the infrastructure is being created.

APIs overwhelmed

Sometimes sudden increases in traffic can put too much stress on APIs, allowing applications to properly handle requests. This can be prevented by using a queue to take tasks asynchronously.


First, have a good structure to ensure that you cover all potential scenarios related to workload, scalability, flexibility and high availability of the application. This includes the application, infrastructure, and third parties. The next thing is to establish a good monitoring system that identifies anomalies. AIOps solutions are great options to cover all your monitoring requirements. These solutions can manipulate application behavior and find anomalies that fall outside the primary monitoring radar. Don’t miss adding the recording to the watch to record all the events.

Last but not least, test your application and infrastructure to ensure they’re consistently well configured for your current workloads. You can consider taking pregnancy tests from time to time to simulate a greater workload to measure this. This is an ongoing activity that helps keep your app running and reliable.

This is an article from DZone’s 2021 Application Performance Management Trends Report.

For more:

Read the report


Leave a Comment