Understanding Canary Analysis – DZone DevOps

We discussed the basics of canary deployment in the blog What is Canary Deployment? It is one of the most widely used strategies because it reduces the risk of moving updates to production, while at the same time reducing the need for additional infrastructure.

Today, many applications run in a dynamic environment of microservices, which makes software integration and delivery complex. The only way to successfully roll out updates is to use powerful automation with tools like Spinnaker.

What is canary analysis?

canary analysis It is a two step process where we will evaluate a canary based on a selected list of metrics and records to infer whether to promote or roll back the new release. So we need to make sure that we collect the right information (metrics and logs) during the testing process and do robust analysis. (From now on in this blog we will refer to metrics and logs simply as metrics)

Two steps for canary analysis

Two steps to perform a canary analysis

  1. Selection and evaluation of metrics:
    This step involves defining the correct metrics to monitor the application and the health of the canary. We need to make sure we create a balanced set of scales for evaluating canaries. Finally, if necessary, metrics should be grouped based on correlation with each other. Fortunately, we only do this step once per pipeline.
  2. Selection and evaluation of metrics:
    Monitoring systems in microservices-based applications typically generate huge amounts of data, so we simply cannot analyze all metrics. Selecting all scales is simply not required. Once the high impact metrics are selected, adding any other metrics will increase our efforts. But which one to choose? Read on!

Selection and evaluation criteria

  • Most important business metrics
    The most important metrics are the metrics that are typical and most important to the purpose of the application. For example, in an online shopping checkout app, be sure to measure the number of transactions per minute, failure rate, etc. If any of these metrics are outside the baseline, you will likely decide to back off from the canary. Most of these scales will likely already be monitored in the staging – be sure to check them out in the process canary as well.
  • Balanced combination of fast and slow scales
    One measure is not enough for a meaningful canary operation. Some important metrics to monitor are generated instantly and some take time to build due to their dependence on load, network traffic, high memory usage, or other factors. It is important to balance the scales you choose between fast and slow measures. For example, server query time and response time checks may be examples of slow and fast metrics, respectively. Choose a balanced set of scales that will provide you with a strong overall view of your canary’s health.
  • metric smoke test
    From the balanced set of scales, we have to make sure that we have scales that directly indicate a problem with the canary. Although our goal is to find the health of the canary, it is equally important to know if there is an underlying problem in the canary on such scales. For example, a 404 response or another unexpected HTTP return code (200, 300 seconds) could mean that your test should be stopped immediately and debugged. CPU usage, memory footprint, HTTP return codes (200, 300 seconds, etc.), response time, and health are a good set of metrics, but HTTP return codes and response time are indicative of an actual issue affecting users and services. This is sometimes referred to as the metric smoke test. For smoke testing, metrics of on-time usage (eg on-time CPU usage or network bandwidth) are not as useful as changing these metrics, and can only add noise to our analysis.
  • Metric Standard Ranges
    These metric selections must have an acceptable operating range and we must not be too strict or too lenient. With the agreed-upon acceptable behavior of the scale, we will be able to weed out bad canaries that were incorrectly rated as good. Choosing these ranges is a challenging task that is usually completed incrementally over time. When in doubt, err on the side of being conservative – in other words, choose a small range of acceptable values. We don’t want the canary to be published in the properly evaluated production but it was actually a problem.
  • metric correlation
    With a well-balanced set of metrics, we now have our basic toolkit ready for canary analysis. Most of these metrics will be correlated and finding the correlation and then aggregating it will be critical. We will need to group the metrics in such a way that these groups are isolated. For example, a massive increase in CPU usage for the system as a whole may result in poor scaling because other processes such as DB and batch queries may cause this increase. A better metric is the CPU time spent on each rendered process.

canary hypothesis test

We need to perform the last step of canary analysis to evaluate the scale values. By doing this, we will be able to evaluate whether the Canary instance should be upgraded to production or not.

Before each hypothesis, we will define the two hypotheses, and based on our data we will prove one of them wrong.

H0: Bad Canary: Undo

H1: Canary is good: roll forward

A simple analysis might start by comparing the canary’s results to the larger baseline version. However, this evaluation usually leads to incorrect decisions.

A canary in 10% of traffic may not behave abnormally but there is no way to be confident that it will behave the same way in 30% (or 60% or 90%) of traffic. This may be due to the following reasons.

  • The Canary has just been released with only a few minutes left in production, while the scales from the basic version are probably weeks or months older. Comparing it would be misleading.
  • We routed a small portion of the traffic to the Canary Islands. The pressure on the systems is likely to be disproportionate.

This problem was solved using the A | . test strategy B. Instead of comparing the canary with a baseline of production, we split the canary infrastructure into two pieces, spreading the baseline system in half and the new canary in the other half. This will allow for a level playing field as we can measure whether the new canary compares favorably with the baseline system. Since the instance is small, the cost of testing A | B separate inside the canary spreading will be minimal.

canary hypothesis test


As software deployments move at scale and speed, it becomes critical to have a proper system that evaluates these versions before they can harm the production environment causing losses to the organization and angering customers. Canary sawing provides an excellent technique for reducing the risk of introducing a defect into production, are relatively low cost, and do not slow down the process as long as proper automation is included.


Leave a Comment