Resiliency With Effective Error Management

Building resilient systems requires comprehensive error management. Errors can occur in any part of the system / or its ecosystem and there are different ways to deal with these errors for example

  • Data Center – Data center failure where full DC power can become unavailable due to power outages, network connection failures, environmental disaster, etc. – This is addressed through monitoring and redundancy. Redundancy in power, grid, cooling systems and possibly everything else relevant. Redundancy by building additional data centers
  • Hardware – server/storage hardware/software errors such as disk failure, disk is full, other hardware failures, servers out of resources, server software behaving abnormally, network connectivity issues within DC, etc. – again, the approach here is the same . Monitor servers on different parameters and build redundancy. There are many highly available deployment patterns that are being used and with the advent of the container bundled with the power of DevOps, more efficient patterns have also emerged to solve this problem.

Architects and systems designers need to consider the availability aspects of their components while designing the system according to business needs and cost ramifications. In today’s world, cloud service providers usually take care of the rest.

  • Errors in applications – regardless of whether the application is deployed in the cloud, on-premises, or regardless of the technical group of the application – this is something the individual application teams are responsible for. Cloud deployments will likely help reduce bug instances, and some technical packages may be more mature than others, but bugs will occur and will need to be addressed. With distributed architectures based on microservices, it becomes even more interesting.

There are several steps to making apps crash-resistant:

  • Minimize errors by applying alternative architectural/design patterns. For example, asynchronous processing of user requests may help to avoid instances of server overload and even provide a consistent experience for users.
  • Agile error handling by the application.
  • Raise an Incident if Needed – The important part here is to reliably raise an Incident based on need and not let user requests fall through the loopholes. This is the fallback scenario for applications when they are not able to handle errors. While this will be used to address issues offline (and applications may not always choose this as a route to directly resolve the error at hand) more importantly, it is a critical step for offline error analysis and preventive steps against their recurrence.

Brief note about patterns

There are multiple architectural styles to address application flexibility and much depends on functional requirements and NFRs. The flexibility approach in terms of design also depends on the architectural model of the application – if it is based on microservices, it will be a good bit of focus on bugs related to the microservices integration dependency. In event-based architectures, the focus will also be on reliability in terms of addressing poor operability, and data loss when things go wrong apart from normal error handling. In concurrent API based applications, while applications can simply return the error to the caller, some kind of monitoring/incident management can sometimes be useful if the problem persists for longer. In batch-based components, the focus on being able to restart/resume a batch can be ineffective.

Application Error Handling

Regarding the error handling part of applications, it is important to think carefully beforehand as part of the design process. Even if the details are left for later but at a high level, an approach must be defined which may again vary depending on the use cases and application design.

Error codes

The way we identify error codes is also an important part of error handling. There are no general conventions/guidelines about error codes and every application/system works in its own way of identifying error codes. However, error codes can help greatly if they are thought through and if they are standardized across the organization. It’s like having a common definition that everyone in the organization understands. Furthermore, having clear/intuitive error codes can help increase throughput during resolution, and can help with offline analysis eg most errors that occur across systems, errors that occur during peak loads, and systems that are more susceptible to a particular type of errors, etc., and this can go a long way in engineering with some long-term mitigation taken on such errors – and this could be an important metric in an organization’s overall DevOps operations.

Error handling

Below is an example of how to handle errors in an application based on an event based architecture. Some of the steps mentioned may differ in other architectural styles.

Applications need to distinguish between retryable errors and non-retryable errors. If there is something wrong with the input message itself, then usually, there is no point in trying again on such an error unless there is manual intervention. On the other hand, the database connection issue is worth a try.

When applications retry errors, they can choose a uniform retry configuration across all retryable errors or you may want to adjust them with the help of Error Retry Configuration. For example, in the case of event-based services, an infrastructure component availability issue can be given more time to address before retrying rather than relaying some temporary concurrency issues. At the very least, infrastructure errors are worth retrying for longer. There is no point in stopping the retry of the current event and consuming new events if the unavailability of the underlying infrastructure service itself will not allow those events to be processed.

provoke accidents

Ultimately, when all retries fail, there must be a way to escalate the error and trigger the crash whenever needed. There are cases where the problem can simply be traced back to the user through notifications and that user has to resubmit the requested request but this leads to bad user experience if the problem is due to an internal technical issue. This is especially true in the case of event-based architectures. Asynchronous integration patterns usually use DLQ as another error handling pattern. However, DLQ is only a temporary step in the overall process. Whether through a DLQ or by other means, if there is a way to reliably escalate the error so that it creates an operational alert/crash, that would be a desirable outcome. How can we design such integration with Incident Management System/Alert Management System? Here are a few options.

The first method uses the logging feature available in all applications and the path that is least resistant and guaranteed to report an error. When an app is done with all the retries and trying to escalate the error we need to try the ad to give it the most reliable path where the chances of error are less. Recording fits well in those criteria. However, we would like to separate these logs from all other error logs or else the incident management system will be flooded with potentially unrelated errors. We call these logs “Error Alerts” – these error alerts can be logged by a custom library/component whose job is to format the error alert with the required amount and maximum amount of information and log it in the required format. for example:

  {
      "logType": "ErrorAlert",
      "errorCode": "subA.compA.DB.DatabaseA.Access_Error",
      "businessObjectId": "234323",
      "businessObjectName": "ACCOUNT",
      "InputDetails" : "<Input object/ event object>",
      "InputContext" : " any context info with the input",
      "datetime": "date time of the error",
      "errorDetails" : "Error trace",
      "..other info as needed": "..."
  }

These logs are read by the log collector (which will already be there due to the log monitoring stack that most organizations use). The log collector routes these logs to a different component whose job is to read these log events, read the configuration and raise incidents/alerts as needed. There is again DLQ processing required here if things go wrong which will require monitoring and addressing.

Creating an event/alert requires some initialization so that a meaningful and actionable incident can be generated. Below is an example of required configuration attributes. This may depend on the specific indentation management system your organization is using. This configuration can also drive different types of actions. Since error codes follow a specific classification across the organization, this can become a central configuration if necessary.

The second approach is similar but based on DLQ.

The error alert dispatcher component writes to the DLQ instead of writing to the logs. Everything else remains exactly the same.


Which approach is better?

The log-based approach is more flexible from an implementation point of view, but there are also some shortcomings:

1. More moving parts/integrations before the fault reaches the incident management system. You will need to deal with it.

2. Risk of losing history – This is something to check. If there is such a risk, this approach is not good. In general, the importance of logs data is not very high but if we are using it to increase accidents it would be good to check if it is good to rely on that. In the implementation we went into we realized that there was a risk of losing log data at peak volumes and thus had to discard this approach but this may not be the case with all logging environments.

The DLQ-based approach has its own pros and cons:

1. The primary or perhaps the only cheat I see is the step of contacting the DLQ. Do we need some kind of DLQ over DLQ on some other messaging platform as redundancy? This chain can be endless. Depends on how important the data is.

2. Another deception may be the number of message routers that will need to communicate with the central bus to send error alerts if we integrate all the applications in the enterprise. Perhaps some kind of union will be needed but this is where the solution starts to get a little complicated with additional opportunities for error.

3. Rest everything seems fine. There are fewer components that could otherwise be integrated, and with bus-based integration, there is higher reliability in sending error alert events.

conclusion

A comprehensive error management approach is needed and application error handling should be a part of that. Needs to integrate seamlessly into the overall IT error/problem management of the organisation. While this written message addresses the integration of application error handling into an incident management system, a similar approach should be applied to hardware issues as well. All these items should come together in one place in an automated way so that errors/problems can be linked further and a single solution can be applied to solve all these problems.

Having said that, either approach depends on the incident management system’s ability to integrate with modern technologies such as APIs or some type of SDK. This may vary from platform to platform. This is an important and main dependency of this work. Another concern could be creating duplicate incidents, flooding the incident management system. This is something that out-of-the-box incident management systems have to offer as they are the master of incident data. Solutions can be built around this problem but that can be complex and risky. Incident management systems are beginning to address this issue by supporting smarter de-recurrence of incidents.

.

Leave a Comment