OpenTelemetry in Practice: Distributed Tracing Across Message Queues

18 min read
3,410 words
OpenTelemetry in Practice: Distributed Tracing Across Message Queues

You’ve instrumented your services, you’ve got beautiful traces flowing through Grafana Tempo, your on call engineers can finally see what’s happening across the stack, and then someone introduces an SQS queue between two services and your traces just… stop. Two orphaned traces where there should be one connected flow. That beautiful Grafana dashboard now looks like a jigsaw puzzle with the most interesting pieces sabotaged!

I’ve spent the better part of the past year building observability stacks for event driven architectures, and this is the single most common gap I see. Synchronous HTTP tracing is a solved problem — the W3C Trace Context standard and OpenTelemetry’s auto-instrumentation handle it beautifully. But the moment you introduce asynchronous messaging — SQS, RabbitMQ, Kafka — context propagation becomes your problem to solve.

This post walks through the reality of maintaining distributed traces across message queue boundaries using OpenTelemetry: real code, real Collector configuration, and the pitfalls I’ve hit so you don’t have to.

Why Distributed Tracing Matters

If you’re already sold on observability, feel free to skip ahead. But for anyone still on the fence, let me frame it this way.

Logs tell you what happened. Metrics tell you how much. Traces tell you why and where.

When a customer reports that their order wasn’t processed, logs give you a needle in a haystack search across 8 different services. Metrics tell you that error rates spiked at 2:47am. But a trace? A trace lets you follow a single request from the initial HTTP call, through every service hop, queue message, and database query, straight to the point where things fell apart. It’s the difference between “something broke” and “the payment validation service timed out waiting for a response from the fraud check queue, which was backed up because a bad deployment doubled consumer processing time.” One of those gets you a root cause in minutes. The other sees you into a war room, a Teams thread that involves 40 people and a suggestion to “just restart everything.”

The challenge is that modern architectures are increasingly event driven. Services communicate through message queues for good reasons — decoupling, resilience, scalability. But these async boundaries are exactly where traditional tracing breaks down.

As of early 2026, OpenTelemetry’s tracing APIs are stable across all major languages (Python, Java, Go, .NET, JavaScript, Ruby, and more). This isn’t experimental anymore. The CNCF Annual Survey 2024 reported OTel at 49% adoption — the second highest among non-graduated CNCF projects — and Grafana Labs found that 85% of respondents are actively investing in OpenTelemetry.

OpenTelemetry Fundamentals: A Quick Refresher

If you’re already comfortable with OTel concepts, skip to the next section. For everyone else, here’s the 60 second version.

A trace represents an end to end request flowing through your system. It’s a tree of spans, where each span represents a unit of work — an HTTP handler, a database query, a message publish, a message consume.

Every span has a trace ID (shared across the entire trace), a span ID (unique to this span), a parent span ID (linking it to the span that started this work), timing information, and attributes (key/value metadata like http.method, db.statement, messaging.system).

The glue that holds this together is context propagation. When Service A calls Service B over HTTP, the OpenTelemetry SDK injects a traceparent header into the request. This header carries the trace ID and the current span ID. Service B extracts it, creates a child span linked to the parent, and the trace continues. This follows the W3C Trace Context standard, and it happens automatically with auto-instrumentation for HTTP libraries.

The SDK pipeline looks like this: you configure a TracerProvider with a Resource (identifying your service), add a SpanProcessor (usually BatchSpanProcessor for production), and connect it to an exporter (OTLP to your Collector). Then you get a Tracer and start creating spans.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "order-processor",
    "service.version": "1.4.2",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

Simple enough. If only the rest of the story were this straightforward…

The Problem: Traces That Die at the Queue

Here’s what happens with synchronous HTTP calls: Service A makes a request to Service B. The OTel SDK automatically injects traceparent into the HTTP headers. Service B’s instrumentation extracts it, creates a child span, and the trace continues seamlessly. You didn’t write a single line of propagation code.

Now consider what happens with a message queue. Service A publishes a message to SQS. The message sits in the queue for anywhere from milliseconds to hours. Service B polls the queue, picks up the message, and processes it. But here’s the thing — there was no HTTP call between A and B. The SDK had no opportunity to inject headers. The consumer has no idea that this message was part of a larger trace.

The result? Two disconnected traces. The producer side shows a span for “publish message” that ends abruptly. The consumer side shows a span for “process message” that appears to have no parent. If you’re looking at this in Grafana, you see two separate, orphaned traces where there should be one connected story.

This happens because message queues break the assumption that services communicate synchronously. HTTP auto-instrumentation works by intercepting the request/response lifecycle. There’s no equivalent lifecycle to intercept for a message sitting in a queue. The message is just… vibing. Existing in S3 somewhere, completely unaware that anyone cares about its provenance.

To make it worse:

  • Messages can be batched — one poll might return 10 messages, each from different traces
  • Messages can be retried — the same message processed multiple times
  • Messages can be dead-lettered — failures route to a DLQ with no trace context
  • The producer and consumer have no direct connection — they don’t know about each other

Context Propagation Through Message Attributes

The fix is conceptually simple: inject the trace context into the message when you publish it, and extract it when you consume it. The mechanism varies by messaging system.

For SQS, you use MessageAttributes. For RabbitMQ, it’s message headers. For Kafka, it’s record headers. The principle is identical across all three: serialise the W3C Trace Context (traceparent and tracestate) into the message metadata on the producer side, and deserialise it on the consumer side.

Let’s look at the SQS implementation in Python.

The Producer

import json
from opentelemetry import trace
from opentelemetry.propagate import inject
import boto3

tracer = trace.get_tracer(__name__)
sqs = boto3.client("sqs", region_name="ap-southeast-2")

QUEUE_URL = "https://sqs.ap-southeast-2.amazonaws.com/123456789012/order-events"


def publish_order_event(order_id: str, event_data: dict) -> None:
    with tracer.start_as_current_span(
        "publish order_event",
        kind=trace.SpanKind.PRODUCER,
        attributes={
            "messaging.system": "aws_sqs",
            "messaging.destination.name": "order-events",
            "messaging.operation.type": "publish",
            "order.id": order_id,
        },
    ) as span:
        # Inject trace context into a carrier dict
        carrier: dict[str, str] = {}
        inject(carrier)

        # Convert the carrier to SQS MessageAttributes format
        message_attributes = {
            key: {"DataType": "String", "StringValue": value}
            for key, value in carrier.items()
        }

        response = sqs.send_message(
            QueueUrl=QUEUE_URL,
            MessageBody=json.dumps(event_data),
            MessageAttributes=message_attributes,
        )

        span.set_attribute(
            "messaging.message.id",
            response.get("MessageId", ""),
        )

The key line is inject(carrier). This takes the current span context and serialises it into the carrier dictionary as traceparent and tracestate key/value pairs. We then convert those into SQS MessageAttributes.

The Consumer

import json
from opentelemetry import trace
from opentelemetry.propagate import extract

tracer = trace.get_tracer(__name__)


def process_message(message: dict) -> None:
    # Extract trace context from SQS MessageAttributes
    carrier = {
        attr_name: attr_value["StringValue"]
        for attr_name, attr_value in message.get("MessageAttributes", {}).items()
        if attr_value.get("DataType") == "String"
    }

    # Restore the parent context from the message
    parent_context = extract(carrier)

    with tracer.start_as_current_span(
        "process order_event",
        context=parent_context,
        kind=trace.SpanKind.CONSUMER,
        attributes={
            "messaging.system": "aws_sqs",
            "messaging.destination.name": "order-events",
            "messaging.operation.type": "process",
        },
    ) as span:
        event_data = json.loads(message["Body"])
        span.set_attribute("order.id", event_data.get("order_id", ""))

        # Your business logic here
        handle_order(event_data)

The extract(carrier) call does the reverse — it reads traceparent from the carrier and reconstructs the parent context. When we create our consumer span with context=parent_context, it becomes a child of the original producer span. One connected trace, across an async boundary.

Beyond traceparent and tracestate, OpenTelemetry also supports Baggage — a mechanism for propagating arbitrary key/value pairs (like tenant IDs or feature flags) alongside trace context. While span attributes are local to a single span, baggage propagates across service boundaries automatically. It travels in the same carrier as trace context, so inject and extract work identically. It’s useful when you need business context to follow a request through every service without manually adding attributes at each hop.

SQS has a hard limit of 10 message attributes per message. The OTel propagation needs at least 2 (traceparent and tracestate). If you’re already using 8+ custom attributes, you’ll need to be strategic about which ones to keep. I’ve seen teams consolidate multiple attributes into a single JSON-encoded attribute to free up slots.

A Note on AWSTraceHeader

If you’re using AWS Lambda’s builtin X-Ray integration (without the OTel SDK), trace context propagation uses the AWSTraceHeader system attribute — that’s your only option, and it’s X-Ray format rather than W3C. However, if your Lambda uses the OpenTelemetry SDK or Lambda layer, you can inject and extract traceparent via custom MessageAttributes just like any other consumer. This gives you W3C-standard propagation regardless of whether the consumer is Lambda, ECS, or EC2. The confusion comes from mixing up “Lambda’s native X-Ray tracing” with “a Lambda function that happens to use OTel” — they’re different beasts, and AWS’s documentation isn’t giving you any luxury of clarifying the distinction.

Auto-Instrumentation vs Manual: Pick Your Battles

“Do I really need to write all that manually?” No, not always. The opentelemetry-instrumentation-botocore package automatically instruments SQS send_message calls, injecting trace context into message attributes. However, for full end to end trace linking — including proper context extraction on the consumer side with per-message span handling — you’ll want the dedicated opentelemetry-instrumentation-boto3sqs package as well. Think of botocore as the general AWS SDK instrumentation and boto3sqs as the SQS specialist who actually understands message iteration.

from opentelemetry.instrumentation.botocore import BotocoreInstrumentor
from opentelemetry.instrumentation.boto3sqs import Boto3SQSInstrumentor

# General AWS SDK tracing (API calls, DynamoDB, S3, etc.)
BotocoreInstrumentor().instrument()

# SQS-specific producer-consumer trace linking
Boto3SQSInstrumentor().instrument()

With both in place, your send_message() calls inject trace context and your message iteration on the consumer side properly extracts it — creating correctly linked child spans.

Use auto-instrumentation when:

  • You’re using standard SQS send/receive patterns
  • Single message processing (not batch)
  • You want baseline visibility with minimal effort

Use manual instrumentation when:

  • You’re processing messages in batches and need per-message spans
  • You want to add business specific attributes (order IDs, customer segments, processing stages)
  • You have custom message routing or fan-out patterns
  • You need fine grained control over span naming and structure

The sweet spot, in my experience, is a hybrid approach: enable auto-instrumentation for the baseline SQS operations, then add manual spans for your business logic within the consumer. You get the plumbing for free and add detail where it matters.

This pattern extends beyond SQS. For RabbitMQ, there’s opentelemetry-instrumentation-pika. For Kafka, opentelemetry-instrumentation-confluent-kafka. Same principle, different transport.

Collector Configuration

The OpenTelemetry Collector sits between your instrumented services and your tracing backend. It receives spans, processes them (batching, filtering, enriching), and exports them to wherever you store traces.

You can deploy it as a sidecar (per-container), an agent (per-host), or a gateway (centralised). For most setups, the agent pattern — one Collector per host — is the sweet spot. It will keep network hops minimal while still centralising processing configuration. If you’re wondering which to choose, start with agent and upgrade to gateway when your Collector configs start looking like they need their own version controlled repository. (They will.)

Here’s a prod ready Collector configuration:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # Always first — prevents the Collector from running out of memory
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  batch:
    timeout: 5s
    send_batch_size: 1024

  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: upsert

  # Drop health check spans — they're noisy and useless
  filter:
    error_mode: ignore
    trace_conditions:
      - span.attributes["url.path"] == "/health"
      - span.attributes["url.path"] == "/readiness"

  # Redact sensitive data before it leaves the Collector
  transform:
    trace_statements:
      - context: span
        statements:
          - replace_pattern(attributes["db.statement"], "email = '[^']*'", "email = '***'")

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: false
      cert_file: /certs/collector.crt
      key_file: /certs/collector.key

  debug:
    verbosity: basic

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes, filter, transform]
      exporters: [otlp/tempo]

Processor ordering matters. The memory_limiter should always come first in your pipeline. If the Collector runs out of memory, nothing else matters — you’ll lose spans silently. The batch processor should come early too, as it significantly reduces export overhead by grouping spans together.

Visualising Traces: Grafana Tempo

You need somewhere to store and query your traces. I use Grafana Tempo, and if you’re already running Grafana for metrics and logs, it’s the natural choice.

Why Tempo Over Jaeger

Jaeger has been the go to open source tracing backend for years, and Jaeger v2 (rebuilt on the OTel Collector architecture in late 2024) is a genuine step forward — it now supports multiple storage backends including Cassandra, Elasticsearch, ClickHouse, and Badger (an embedded store requiring no external database). Still, for production at scale, you’ll likely run Cassandra or ClickHouse, which adds operational overhead. You end up managing indexing, capacity planning, and retention policies for what is ultimately a database you’d rather not think about.

Tempo takes a different approach. It writes traces directly to object storage (S3, GCS, Azure Blob) with a lightweight index. No Cassandra cluster to babysit. No Elasticsearch nodes to tune. The operational simplicity is significant, especially for smaller teams who’d rather spend their time building features than babysitting a tracing database.

Tempo originally focused on trace ID lookup — find the ID from a log line or metric exemplar, then pull up the full trace. But TraceQL has since turned it into a proper query engine. You can now search by span attributes, filter by duration or status, run structural queries (find traces by shape), and even generate metrics directly from trace data with aggregate functions like rate() and quantile_over_time(). Combined with Grafana’s correlation features, you get a powerful tracing backend with the operational simplicity of “put it in S3 and forget about it.” For most operational workflows, this is exactly what you want.

The Logs-to-Traces Workflow

This is where the real power lives. With Grafana, Loki (for logs), and Tempo (for traces) configured together, you get a seamless workflow:

  1. An alert fires based on an error rate metric
  2. You click through to the correlated logs in Loki
  3. The log line contains a trace ID (because your application logs include it)
  4. You click the trace ID and Grafana jumps directly to the trace in Tempo
  5. The trace shows the full request flow, including the async queue hops

When your cross-queue traces are properly connected, you see the producer span, a gap representing the queue latency, and then the consumer span as a child — all under a single trace ID. That gap is useful information too; if it’s growing, your queue is backing up.

Common Pitfalls and Hard-Won Lessons

I’ll save you some of the debugging sessions I’ve endured. Consider this the “mistakes I’ve already made so you don’t have to” section — which, is probably the only reason anyone reads blog posts like these.

Batch consumers need per-message spans. When you poll SQS and get 10 messages back, each message has its own trace context from its own producer. A common mistake is creating a single span for the entire batch. You need to extract context and create a span per message. The batch poll itself can be a parent span, with individual message processing spans as children.

Dead letter queues need tracing too. When a message gets DLQ’d after max retries, the trace context travels with it in the message attributes. Your DLQ consumer should extract that context just like your primary consumer does. This gives you an unbroken trace from the original producer, through the failed processing attempts, to the DLQ handler. Without this, your DLQ messages are untraceable black holes — and debugging a DLQ without traces is like trying to solve a murder mystery where all the evidence was thrown away. One subtlety: each retry creates a new receive_message event, so make sure your consumer extracts trace context from the message attributes (not from the SQS receive operation itself) to maintain trace continuity across retries.

Sampling decisions must be consistent. If you’re sampling traces (and at scale, you should be), the decision needs to happen at the trace root — the first service that creates the trace. If the producer samples at 10% and the consumer independently samples at 10%, you’ll end up with only 1% of traces having both sides. Use head-based sampling at the root, and propagate the sampling decision through the trace context. An alternative worth considering is tail-based sampling at the Collector level — instead of deciding at the root, the Collector buffers completed traces and makes sampling decisions based on outcome. Always keep error traces and slow traces (say, >500ms), while probabilistically sampling normal traffic at 10%. You never randomly miss an error again. The trade off is memory overhead in the Collector (it must buffer traces until all spans arrive), but for most production workloads, the visibility gain is well worth it.

Clock skew is normal; don’t panic. Distributed services have slightly different clocks, even with NTP. You’ll occasionally see consumer spans that appear to start before the producer span ends by a few milliseconds. This is expected — your services haven’t invented time travel, they just have slightly different opinions about what time it is. If the skew is more than a few seconds, check your NTP configuration.

Keep sensitive data out of span attributes. It’s tempting to add customer IDs, email addresses, or order details as span attributes for debugging convenience. Don’t. Trace data often has different retention and access controls than your application databases, and nobody wants to explain to the security team why customer email addresses are sitting in a Tempo bucket with no encryption at rest. Use the Collector’s transform processor to redact sensitive patterns before they reach your backend.

Start manual, then automate. I’ve found the best approach is to manually instrument one service end to end first, verify the traces look correct in your tracing UI, and then introduce auto-instrumentation. Starting with auto-instrumentation and hoping for the best leads to confusing gaps where you’re not sure if the missing spans are a configuration issue or an instrumentation limitation. Ask me how I know. Well… it involved three days and a growing suspicion that the Collector was gaslighting me.


Distributed tracing across message queues isn’t magic, but it does require effort. The combination of OpenTelemetry’s propagation APIs, proper message attribute handling, and a well configured Collector gives you the end to end visibility that event driven architectures desperately need.

If you’re running an event driven architecture without cross-queue tracing, you’re debugging with one eye closed. The tooling is mature enough, the standards are stable enough, and the operational overhead is low enough that there’s no good reason not to do this. The hardest part is the initial setup — once the plumbing is in place, every new service you instrument extends the trace graph automatically. And the first time you pull up a trace that flows seamlessly from an API gateway, through three queues, and into a DLQ handler, you’ll wonder how you ever debugged distributed systems without it. (The answer is “badly.” We debugged them badly.)

OpenTelemetry’s profiling support is coming along nicely too, which will eventually give us the ability to correlate traces with CPU and memory profiles. Imagine clicking on a slow span and seeing exactly which function was consuming resources. But that’s a post for another day.

Want to dig deeper? Check out the OpenTelemetry documentation for the full specification, the Collector configuration reference for all available processors and exporters, the Grafana Tempo docs for backend setup, and the OTel boto3/botocore instrumentation for SQS auto-instrumentation details.

All posts
Last updated: 1 March 2026