Use Machine Learning to Observe API Metrics

This blog was co-created by Ricardo Ferreira (Elastic) and Victor Gamow (Kong).

We love our microservices, but without a proper monitoring strategy (O11y), they can quickly turn into cold, dark places full of broken or unknown features. O11y is one of those technologies that is seen by causation: the only reason it exists is because other technologies have paid for it. The O11y wouldn’t be needed if, say, our technologies hadn’t become so complex over the years.

Let’s say your microservices have turned into this dreadful dark place. In that case, finding the root cause of specific problems across several microservices might frustrate you like Charlie Kelly’s character from It’s Always Sunny in Philadelphia.

It’s hard to pinpoint the root cause of problems when you have to scatter the problem on different machines, virtual machines, and containers and research code written in other programming languages.

Surveillance systems: past and present

Twenty years ago, we had a very unique approach to monitoring. Web applications are made up of a bunch of HTTP servers, where we can easily store them and look at the logs to look for issues. Now, we live in an era of massively distributed systems where the number of servers building our clusters is unknown from the top of our heads.

This is like the comparison of pets versus cattle. Dealing with several pets that you know by name and whose behaviors you understand is one thing. You know what to expect from them. But dealing with livestock is different. You don’t know their names because they may appear every second, and you don’t know their behaviour. It is as if you are traveling in an unknown territory with every one of them.

Metric Friend Chart

The most important aspect of O11y’s recent strategy is to store all of this data, known as telemetry data, and aggregate it into a single platform that is able to harness its power by applying correlation and causation. Instead of treating all telemetry data as a backbone as most people do – treat them as pipelines that take in the data where it makes sense. Flexibility is one platform that is capable of this.

Flexibility of observation and koma service network

This blog post will delve into the specific tasks you need to perform to get data from Kuma sent to Elastic Observability to enable machine learning features to analyze it. This will allow you to achieve two different goals.

First, it allows you to eliminate the chaos caused by other monitoring systems. By default, Kuma sends metrics to Prometheus/Grafana, tracks Jaeger, and logs to Logstash. You can replace all of this with Elastic Observability. This simplifies the structure, reduces the amount of plumbing, and reduces the operational footprint required to maintain the Kuma.

Then, once all of this data is put into Elastic Observability, users can use the dashboards and built-in apps to analyze the data anytime they want. But they can also take advantage of machine learning platform support to deploy jobs that can do the work of constantly beating numbers to focus on the communication aspects of Kuma.

If you prefer the video, watch the full recording of the 2021 Kong Summit session here.

Enable Monitoring on Kuma

When we enable the Kuma mesh code in our cluster, we will be able to do various things related to the observability.

Screenshot and metadata of the api

apiVersion: kuma.io/v1alpha1
kind: Mesh
metadata:
  name: default
spec:
  logging:
    # TrafficLog policies may leave the `backend` field undefined.
    # In that case the logs will be forwarded into the `defaultBackend` of that Mesh.
    defaultBackend: file
    # List of logging backends that can be referred to by name
    # from TrafficLog policies of that Mesh.
    backends:
      - name: logstash
        # Use `format` field to adjust the access log format to your use case.
        format: '{"start_time": "%START_TIME%", "source": "%KUMA_SOURCE_SERVICE%", "destination": "%KUMA_DESTINATION_SERVICE%", "source_address": "%KUMA_SOURCE_ADDRESS_WITHOUT_PORT%", "destination_address": "%UPSTREAM_HOST%", "duration_millis": "%DURATION%", "bytes_received": "%BYTES_RECEIVED%", "bytes_sent": "%BYTES_SENT%"}'
        type: tcp
        # Use `config` field to co configure a TCP logging backend.
        conf:
          # Address of a log collector.
          address: 127.0.0.1:5000
      - name: file
        type: file
        # Use `file` field to configure a file-based logging backend.
        conf:
          path: /tmp/access.log
        # When `format` field is omitted, the default access log format will be used.

This configuration prepares the logs to be sent to a TCP endpoint running on localhost over port 5000. So we can use Elastic Stack to rotate Logstash instance(s) that expose the same endpoint. Logstash, in turn, will be responsible for ingesting those logs into Elastic Observability.

Loading Scales in Flexibility of Observation

When it comes to moving scales from Kuma to Elastic, Kuma is like the In-N-Out Burger – there are so many options to choose from.

Metricbeat is flexible

The first option is Elastic Metricbeat, which periodically scrapes the Koma-enabled Prometheus endpoint. Then, Metricbeat can read the data and store it in Elastic. It stores and formats the data in ECS format which will enable data analysis.

Prometheus Service Endpoint Scraping ECS ​​Metrics

We can apply Metricbeat to bare-metal, Kubernetes and Docker. We can instantiate quickly using the code below.

etricbeat.config.modules:
 path: ${path.config}/modules.d/*.yml
 
modules:
- module: prometheus
 period: 10s
 metricsets: ["collector"]
 hosts: ["localhost:9090"]
 metrics_path: /metrics 
 
setup.template.settings:
 index.number_of_shards: 1
 index.codec: best_compression
 
output.elasticsearch:
 hosts: ["https://cloud.elastic.co:443"]
 
processors:
 - add_host_metadata: ~
 - add_cloud_metadata: ~
 - add_docker_metadata: ~
 - add_kubernetes_metadata: ~

Metricbeat also offers options to handle pregnancy and avoid scraping altogether. Alternatively, we can have Kuma send metrics to an endpoint exposed by Metricbeat, which takes advantage of a feature from Prometheus called remote_write. This option presents an interesting technique for scaling because it is easier and faster to expand the metric group layer instead of the Kuma.

OpenTelemetry Collector

Another option is the compiler from the OpenTelemetry project, which does the same thing as Metricbeat, getting rid of the Prometheus endpoint exposed by Kuma. The only difference is that OpenTelemetry will send this data to Elastic in OTLP format, which is the original protocol for OpenTelemetry. Once the data is received by Elastic Observability via the OTLP format, it will then be converted to ECS natively. This is only possible because Elastic Observability supports OpenTelemetry natively.

otlp drawing o11y flexible

Like Metricbeat, OpenTelemetry can be deployed as a side element, whether on Kubernetes, Docker, or Metal. We can adjust several options to keep up with the load, so it’s highly configurable. It’s a great option for those who want to stick to an open standard.

receivers:
 prometheus:
   config:
     scrape_configs:
       - job_name: "prometheus"
         scrape_interval: 15s
         static_configs:
           - targets: ["0.0.0.0:9100"]
exporters:
 otlp:
   endpoint: "https://cloud.elastic.co:443"
   headers:
     "Authorization": "Bearer <BEARER_TOKEN>"
 
service:
 extensions: [health_check]
 pipelines:
   metrics:
     receivers: [prometheus]
     processors: [batch]
     exporters: [otlp]

It’s important to know that OpenTelemetry does not yet support remove_write, so if we’re looking for a solution that allows us to push metrics to Elastic Observability and also handles load, Metricbeat is a much better option. This may change in the future as the OpenTelemetry project is rapidly developing and catching up in the monitoring space.

Enabling Machine Learning in Observational Flexibility

How can we take advantage of machine learning features in the Elastic for Kuma service network?

Step 1: Enable Machine Learning (ML)

The first step is to enable ML in our Elasticsearch suite. The Elastic Observability data store is Elasticsearch, which natively supports ML. We can enable ML features by setting this in the configuration file for each Elasticsearch node. It is also necessary to limit the size of our nodes to handle ML workloads, which tend to be very CPU and memory related. Alternatively, we can enable automatic scaling with Elastic Observability via Elastic Cloud. This enables the mass to grow and contract dynamically as our load requirements change.

Screen scale automatically to hold machine learning

Step 2: Possibility to normalize data

We may want to tolerate data normalization, because the observation data coming from Kuma in the form of Prometheus may not be sufficient for our ML analysis. In the fluid world, we can do data massage with Transforms. We can use adapters to build entity-centric indicators that better represent our dataset.

Create a conversion configuration

Step Three: Identify Key Indicators

One of the great things about ML support in Elastic Observability is the built-in algorithms for machine learning. There are many classifications, from simple to complex regressions. Since there are so many, you may feel confused about which one to choose. We can use the data visualizer tool to load a sample of the dataset and see how each algorithm changes and analyzes our data. This is useful because it can happen before ML jobs are deployed.

kibana sample data flight

Step 4: What kind of job do we want?

Ultimately, what we want to do is enable actual algorithms, such as external detection. The ML functions will do the job for us, and they can classify these anomalies and do some regression and put them into boxes so we can classify our anomalies as they might be different. Our analysis will become much easier if we properly enable the algorithms to stream their results into easy-to-view categories.

Flexibility anomaly explorer

Step 5: Metrics Observer Monitoring

Finally, we can use the observer to monitor our results. This means that we can configure Elastic Observability to look for specific patterns related to our dataset, such as when the invocation time spent on the request becomes higher than normal in the past hour. We can configure Elastic Observability to find this for us automatically. Instead of watching it all day, we can automatically trigger an email, an alert, a PagerDuty call, or even a message on the Slack channel with the contact team. We call this alert.

connector in flexible selection

Distributed effects from coma to elasticity of observation

We talked before about how to enable logs and metrics on Kuma. But there is a third type of signal we can enable on Kuma that helps with O11y’s strategy: tracking. The traces help us analyze the interactions between the different services and systems that communicate through Kuma. We can quickly enable this tracking inside the code specification.

spec:
  tracing:
    defaultBackend: jaeger-collector
    backends:
    - name: jaeger-collector
      type: zipkin
      sampling: 100.0
      conf:
        url: http://jaeger-collector.kuma-tracing:9411/api/v2/spans

In this case, we enable the Jaeger trace set and send that trace back to the HTTP endpoint. We also enabled and configured the sampling strategy to collect 100% of the traces. This means that for every interaction within the Kuma, a trace will be collected and emitted. Eventually, you might get a little chatty, depending on how many data planes have been deployed.

This is not usually a problem at the Kuma layer, but it can become an issue while transferring this data to Elastic – both networking and storage can become a bottleneck. Fortunately, we can solve this at the collector level, as we’ll dive into the next step.

To send traces from Kuma to Elastic Observability, we can use the OpenTelemetry assembly tool. There, we can configure the Jaeger receiver and expose it to the specific host and port that Kuma is configured to send data to. We will also configure a source to send the trace to Elastic Observability. Optionally we can configure a handler that can cache / throttle transmission to allow Elastic Observability to receive data at a pace it can handle. This is important if, for example, we enable the sampling strategy in Kuma at 100%, but we do not want to send all the data to the backend.

OTel graph collector flexible service o11y

receivers:
 jaeger:
   protocols:
     grpc:
       jaeger/withendpoint:
         protocols:
           grpc:
             endpoint: 0.0.0.0:14260

Once traces, metrics, and logs are integrated with Elastic, they can be linked and talk to each other. Algorithms will be more accurate. For example, if we know that the trace carries information about the affected transactions, we can quickly determine the root cause of where the problem occurred.

conclusion

We are now ready to change the status quo. We can work smarter by combining the flexibility of Kuma with the power of Elastic Observability to ingest, store, and analyze massive amounts of monitoring data. We learned how to collect metrics from Kuma via Prometheus, bring those metrics to Elasticsearch using Metricbeat and create machine learning functions to look for anomalies that can alert us when something interesting is going on.

Look out for a future Kong Builders episode on this topic, where we’ll dive deeper into real use cases on how to use Kuma and Elastic together to work smarter rather than hard. Tell us what you’d like to see on Twitter @riferrei or @gAmUssA.

.

Leave a Comment