Trace Context Propagation With OpenTelemetry | by Dmitry Kolomiets | Apr, 2022

Music on Tracing

Practical notes on using AWS Distro for OpenTelemetry

Photo by Anne Nygård on Unsplash
Table of ContentsSummary
Trace context propagation
End to end trace
Dangling AWS Lambda traces

Until this point I was talking about OpenTelemetry and ADOT in general, without discussing specific programming languages. The following post is based on AWS managed Lambda layer for ADOT Python. I believe that conclusions are generally the same for other languages.

The aim for this post is exactly as outlined in the previous one — demonstrate how we can manually pass the trace context around when AWS services do not support this natively (for example, with AWS Kinesis streams). This time we will do this with OpenTelemetry, not AWS X-Ray (we are going to use X-Ray console to review the resulting traces though).

Now when you know about OpenTelemetry collectors, distributions, and can tell the difference between ADOT and ADOT — we can finally talk about practical bits.

Review Approaching OpenTelemetry if you need a refresher on these topics.

To demonstrate the trace context propagation with OpenTelemetry I am going to use ADOT Python Lambda layer and OpenTelemetry Python SDK for the following reasons:

  • I used Python in the previous post to demonstrate trace context propagation with X-Ray SDK. I figured it would be beneficial to see how the same application looks like with OpenTelemery
  • Python is one of the languages ​​(Java is another) that supports automatic instrumentation. It is important to highlight the differences between automatic and manual instrumentation to understand what is happening behind the scenes.

We will use the same application architecture we examined before:

The distributed architecture we want to trace with OpenTelemetry

Producer Lambda

We start with Producer lambda — the most straightforward one. Let me show you the code first:

Producer lambda implementation (automatic instrumentation)

This suspiciously simple implementation not only captures the traces for the lambda function, but also equipment and captures traces from boto3 library, handles interactions with ADOT Collector, and exports the traces to AWS X-Ray. This is the level of service that ADOT Lambda layers provide when automatic instrumentation is enabled!

More specifically, you need to do the following to enable automatic instrumentation:

  • Add ADOT Lambda layer to your function — this installs OpenTelemetry SDK, ADOT Collector, AWS-specific extensions, etc.
  • Enable auto instrumentation by adding AWS_LAMBDA_EXEC_WRAPPER variable with /opt/otel-instrument value

Even without understanding everything that happens behind the scenes, this looks pretty good already.

Automated instrumentation — trace structure

As a sanity check, we can invoke Producer function in the console (with any payload) and see the resulting trace in X-Ray console:

Let’s review the structure of the captured X-Ray trace — there will be an important difference when we move on to the manual instrumentation later.

X-Ray service uses the concepts of segment and subsegment to denote parent/child relationships between units of work that comprise the trace. OpenTelemetry uses a concept of a span for the same purpose — there may be parent and child spans.
Even though we use X-Ray console to view the resulting traces, I am going to use OpenTelemetry terms in this post.

The first two spans are added to the trace by the AWS Lambda service. It is important to understand that these two spans are captured and exported to X-Ray even before our Producer lambda is invoked.

AWS Lambda exports these two spans directly to X-Ray. At the time of writing there is no way to configure AWS Lambda and specify another telemetry backend such Jaeger, Splunk, etc.
An implication — if OpenTelemetry is configured with another telemetry backend, AWS Lambda-specific spans will not be exported there.

The next producer-function span is the span captured by OpenTelemetry — this is the span that is created by the automatic instrumentation startup script (remember that /opt/otel-instrument script we set as AWS_LAMBDA_EXEC_WRAPPER environment variable?).

If we select producer-function span, we will see the child producing_messages span we manually created in our lambda function. There are two additional spans for AWS SQS and AWS Kinesis services as well — these are captured from boto3 library. When automatic instrumentation is used, the startup script instruments are the most common Python libraries to automatically capture spans from them. Note that instrumentors should be installed and packaged with your lambda function in order for this to work. You can find the comprehensive list of supported libraries on GitHub.

Producer function trace

Hope you have a better idea of ​​what is happening when automatic instrumentation is used. Let’s move on to the next lambda.

Consumer SQS Lambda function

This function needs to extract the trace context from the SQS message and ensure that new spans are generated by the function “continue” the original trace for the message. Again, we are using automatic instrumentation for this lambda. Here is the implementation:

As with the Producer lambda, we do not need to bother with OpenTelemetry configuration too much. We extract traceId and spanId fields from SQS message — this part is identical to our previous trace propagation example with X-Ray. There are a few things worth mentioning though.

Server span kind
Another interesting detail is a special SERVER span kind we assign to the consuming_sqs span. This is necessary to be able to see this span as a separate node in X-Ray service map. According to the documentation:

Note that only spans of kind Server are converted into X-Ray segments, all other spans are converted into X-Ray subsegments.

Keep this obscure detail in mind if you plan to export OpenTelemetry traces to AWS X-Ray. If you use another telemetry backend — other span kinds can be useful.

Span links
Note that we add an optional links parameter when we create the consuming_sqs span.

The idea is simple — each lambda invocation may contribute to multiple traces and it may be useful to have a way to correlate them. In our case, we have a lambda function triggered by SQS service. Here are the traces involved when the function is triggered with a batch of N messages:

  • Lambda invocation trace. When SQS service triggers the lambda function, an implicit trace is created. This is the “default” trace —trace that you normally see in the AWS X-Ray console. There will be two spans that AWS Lambda service emits by default (we discussed them above when we examined automatic instrumentation trace structure). This trace may also include spans for the lambda cold start or anything that happens before we start processing SQS messages
  • SQS message traces at most N of them. Remember that each SQS message may belong to a different trace and we want to “continue” that original trace when we process the message, not create a new one

Therefore, when we process SQS message (adding spans to the original trace of the message) we would like to keep the link to the Lambda invocation trace — it may be particularly useful for troubleshooting scenarios.

Links between spans/traces may be a powerful way to express causality. It will be difficult to demonstrate these links in AWS X-Ray console as X-Ray does not support links (yet?). In the next post, I will show how the same OpenTelemetry trace looks like in a different telemetry backend (Jaeger) and will demonstrate the usefulness of the links then.

Consumer Kinesis Lambda function

The last Kinesis consumer lambda is the most tricky one mainly due to the issue in AWS managed Lambda layer for ADOT Python that forced me to abandon automatic instrumentation (delete AWS_LAMBDA_EXEC_WRAPPER environment variable). This is a good thing as I will demonstrate how to add OpenTelemetry support without ADOT magic. Beware, there are many things to absorb:

Let’s cover OpenTelemetry initialization step by step:

Initialize Tracer Provider

The first thing we need to do is to initialize the Tracer Provider — this is the object that handles the collection of the traces and export to the OpenTelemetry Collector. When we initialize the provider we specify an ID generator compatible with X-Ray (not necessary if X-Ray support is not needed) and other attributes we would like to attach to the spans. For this example, we set an explicit service name and capture other AWS Lambda-specific attributes (function ARN, name, memory/CPU allocation, etc.)

The Tracer Provider can be initialized only once — if you try to set another provider, the second call will be ignored.

Let’s unwrap the next line:

  • add_span_processor —registers a span processor with the Tracer Provider. Processor is a construct OpenTelemetry introduces to pre-process data before it is exported (eg modify attributes or sample) or helps to ensure that data makes it through a pipeline successfully (eg batch/retry). A good summary OpenTelemetry processors can be found in Collector GitHub repository.
  • BatchSpanProcessor — the batch processor accepts spans and places them into batches. Batching helps better compress the data and reduce the number of outgoing connections required to transmit the data
  • OTLPSpanExporter — exports spans in OpenTelemetry protocol (OTLP) format. By default, the exporter connects to the local ADOT Collector instance running alongside your lambda code. It is the responsibility of the ADOT Collector to forward the spans further — by default, the collector is configured to pass them on to AWS X-Ray service.

To recap — Tracer Provider manages the collection and export of the traces to ADOT Collector using BatchSpanProcessor and OTLPSpanExporter. With automatic instrumentation enabled all these low-level details are hidden, but now we have complete control over the processing and exporting of the spans. I will use this approach in future posts to demonstrate how we can extend the capabilities provided by ADOT lambda layers and add support for additional telemetry backends.

Moving on:

This line enables instrumentation for boto3 API. With manual instrumentation it is our responsibility to instrument the libraries we are interested in.

Finally (the intended pun), you may have noticed this block of code:

This is necessary to ensure that all spans are properly exported to the ADOT collector even in case of an unhandled exception. This is particularly important when BatchSpanProcessor is used — remember that it batches multiple spans together before it forwards them to the collector. Without force_flush call, some of the spans may be lost (and, of course, these will be the most important ones)

This is it! Now you should have a decent mental model of OpenTelemetry initialization and basic configuration and understand the differences between automatic and manual instrumentation modes provided by ADOT lambda layer.

Before we wrap up we should run the Producer lambda function and check the resulting end-to-end trace:

Unlike my previous post, I am not aware of any “hacks” or unsupported APIs I used in this example of trace propagation. This is a proper implementation according to OpenTelemetry specification. Even better — we can get end-to-end X-Ray traces with OpenTelemetry.

We briefly touched upon spans exported by AWS Lambda service above. These spans are clearly visible when we look at the complete service map in X-Ray:

Consumer lambdas emit OpenTelemetry spans that “continue” our main end-to-end trace. At the same time, we still have traces emitted by AWS Lambda service for these functions. These traces are detached from the main end-to-end trace or “dangling”. They are visible in X-Ray console because of the direct integration between AWS Lambda and AWS X-Ray services. Keep in mind that if you are going to use a different telemetry backend with OpenTelemetry, these AWS Lambda spans will not be exported.

We have covered a lot of ground in this post by showing how to integrate OpenTelemetry into event-driven AWS architectures and get complete end-to-end traces. We are still using AWS X-Ray as our main telemetry backend but we don’t have to — with OpenTelemetry we have a way to export traces to another backend or event to multiple backends at the same time. This is what we are going to do in the next post — I will show you how to integrate with Jaeger — a popular open-source distributed tracing backend.

Leave a Comment