The Mystery of the Missing Data
It was 2 AM on a Tuesday when I got the call. Our production microservices were experiencing mysterious performance degradation, and the observability team was baffled. Traces were showing up for some requests but not others, and when they did appear, they were incomplete. The worst part? The CPU usage was through the roof.
I had been working with OpenTelemetry for about six months at that point, integrating it across our Python microservices architecture. We run a platform processing millions of requests daily, so every millisecond matters. That night, I learned a lesson that fundamentally changed how I think about distributed tracing.
The Setup: What We Were Doing
Our system had three main services:
- API Gateway - Entry point for all requests
- Auth Service - Validates user credentials
- Data Service - Processes and stores data
We had configured OpenTelemetry with what I thought was a sensible setup:
ProbabilitySamplerset to 10% on the API Gateway (sample 1 in 10 requests)AlwaysOnSampleron the Auth and Data services (sample everything)
The reasoning seemed sound: we don’t want to process every trace, so sample at the entry point and let downstream services decide independently. Simple, right? Wrong.
The Problem Reveals Itself
I pulled up the Jaeger dashboard and noticed something odd. When the API Gateway sampled a request (the 10% that passed through), the downstream services were creating spans, but they were being created as NonRecording spans. These spans had zero effect—they didn’t show up in our traces, they didn’t export data, and they didn’t contribute to the trace we wanted to analyze.
But here’s the catch: these NonRecording spans were still doing something. Each one was still being created, still propagating context, still going through span processors. It felt like we were paying a cost for spans we weren’t even using.
I started digging into the OpenTelemetry specification and the Python SDK source code. That’s when I discovered the distinction that would change everything.
Recording Spans vs. NonRecording Spans: The Epiphany
Recording Spans are the spans you actually want. They:
- Capture all your
SetAttribute()calls,AddEvent()calls, and status information - Get processed by span processors and eventually exported to Jaeger, Datadog, or wherever
- Have a meaningful overhead because they’re storing real data
- Show up in your traces and dashboards
# A Recording Span - everything gets captured
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("query_type", "SELECT")
span.set_attribute("table", "users")
span.add_event("query_started")
# ... do work ...
span.set_attribute("rows_returned", 42)
NonRecording Spans, on the other hand, are the spans that don’t make the cut:
- The sampler decided they shouldn’t be recorded
- All calls to
SetAttribute(),AddEvent(), etc. are instant no-ops - They’re never exported anywhere
- They still propagate trace context to child services (crucial for distributed tracing)
- They should be virtually free from a performance perspective
But here’s where I got it wrong: I was treating NonRecording spans like they cost nothing, when in reality, I was creating a NonRecording span for every single request even though the parent didn’t get recorded.
The Ah-Ha Moment
The issue was in how I’d configured the samplers. The API Gateway sampled at 10%, which meant 90% of requests got NonRecording spans. Those NonRecording spans then propagated the “don’t record this” decision downstream through the trace context headers.
When the Auth Service received a request without the “record” flag, even though it was configured with AlwaysOnSampler, it respected the parent’s decision (thanks to ParentBasedSampler being the default behavior). So it created a NonRecording span too.
The cascade continued to the Data Service. Same result.
So here’s what was happening:
- Request arrives at API Gateway
- 90% of the time: NonRecording span created (no-op)
- Request goes to Auth Service
- Auth Service sees “don’t record this trace” in the context
- NonRecording span created (no-op)
- Request goes to Data Service
- Same story: NonRecording span created
Three NonRecording spans per request, 90% of the time. Millions of requests per day. That’s a lot of no-ops.
But wait—NonRecording spans are supposed to be free. So why was CPU spiking?
The Real Culprit: Premature Optimization Gone Wrong
Then I looked at my span processor code. That was the problem. I had written a custom span processor that was doing everything unconditionally:
class MyCustomSpanProcessor(SpanProcessor):
def on_start(self, span, parent_context=None):
# I was doing expensive operations on EVERY span
detailed_info = get_detailed_system_info() # OUCH!
user_context = lookup_user_from_database() # OUCH!
span.set_attribute("system_info", detailed_info)
span.set_attribute("user", user_context)
def on_end(self, span):
# More expensive work
pass
The problem: this processor was running on NonRecording spans too! Even though the span would never be exported, my code was computing expensive information for every single one.
It took me a few hours of profiling to realize that most of the CPU cost was coming from this processor, not from the span creation itself.
The Fix: One Simple Check
The fix was almost embarrassingly simple. I just needed to check if the span was recording before doing expensive work:
class MyCustomSpanProcessor(SpanProcessor):
def on_start(self, span, parent_context=None):
# Only do expensive operations if we're actually recording
if span.is_recording():
detailed_info = get_detailed_system_info()
user_context = lookup_user_from_database()
span.set_attribute("system_info", detailed_info)
span.set_attribute("user", user_context)
def on_end(self, span):
if span.is_recording():
# Only process spans that will actually be exported
pass
But there’s more. I also added a guard in my instrumented code:
@app.route("/api/users/<user_id>")
def get_user(user_id):
with tracer.start_as_current_span("fetch_user_details") as span:
# Only compute expensive data if we're recording
if span.is_recording():
user_data = fetch_from_database(user_id)
span.set_attribute("user_data", serialize(user_data))
else:
# Still fetch data for the response, just don't add to span
user_data = fetch_from_database(user_id)
return user_data
I deployed this fix on a Tuesday afternoon. By Wednesday morning, CPU usage had dropped by 35%. Let me repeat that: 35% CPU reduction just by adding a single if span.is_recording() check.
What Changed
Here’s what I learned that night (well, morning… it was a long debugging session):
-
NonRecording spans are truly free if you treat them correctly - If you don’t do anything in your code that requires actually recording data, a NonRecording span has virtually zero overhead. The span context propagation is handled at the framework level.
-
The sampler makes the decision at span creation time - You don’t get to decide later whether a span is recording. The sampler decides when the span is created based on your sampling strategy and parent context.
-
Sampling strategy matters downstream - Using
ParentBasedSampler(the default) means downstream services respect the parent’s sampling decision. This ensures consistency but also means a single sampling decision at your entry point affects the entire trace tree. -
Check before expensive operations - Always guard expensive operations with
span.is_recording(). This is a pattern I now follow religiously. -
NonRecording spans still propagate context - Even though they’re not recorded, they still pass trace context to child services. This is essential for distributed tracing to work correctly across your system.
The Architecture Lesson
Looking back, I realized my sampling strategy was actually the real issue. I was trying to do sampling at the entry point, but I hadn’t considered the implications downstream.
Here’s what I changed:
Before (naive approach):
- API Gateway: 10% sample rate
- Auth Service: Always sample (but respects parent)
- Data Service: Always sample (but respects parent)
- Result: 90% of traces were silently not recorded, leading to suspicious NonRecording spans everywhere
After (thoughtful approach):
- API Gateway: 100% sample rate (it’s the entry point; we make the decision here)
- Auth Service: Respect parent (no sampler specified, uses default ParentBasedSampler)
- Data Service: Respect parent (same as above)
- Result: Clear recording decisions, no unnecessary NonRecording spans, cleaner mental model
OR, if we really wanted 10% sampling:
- API Gateway: 10% sample rate
- Auth Service: 10% sample rate (not respect parent, be explicit)
- Data Service: 10% sample rate (not respect parent, be explicit)
- Result: Consistent sampling across all services, no surprises from NonRecording spans
The Lesson Sticks
That incident taught me that OpenTelemetry’s design, while powerful, requires you to understand the underlying concepts deeply. NonRecording spans aren’t a flaw—they’re a feature. They allow efficient sampling by creating no-op spans that still propagate context. But you need to know how to work with them.
Now, whenever I instrument code, I follow these principles:
- Understand your sampler - Know why spans are being sampled the way they are
- Use
is_recording()as a guard - Before expensive operations, check if the span is recording - Configure samplers consciously - Think about where you want sampling decisions made
- Profile before and after - Don’t assume anything; measure the impact
Code Example: The Right Way
Here’s the pattern I now use in all my instrumented code:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_request(request):
with tracer.start_as_current_span("process_request") as span:
# Cheap operations - always do these
span.set_attribute("request_id", request.id)
span.set_attribute("method", request.method)
# Expensive operations - guard with is_recording()
if span.is_recording():
span.set_attribute("request_headers", dict(request.headers))
span.set_attribute("user_context", get_user_context(request))
# Your business logic
result = do_work(request)
# Record result only if we're sampling
if span.is_recording():
span.set_attribute("result_status", result.status)
span.set_attribute("processing_time_ms", result.time_ms)
return result
Conclusion
OpenTelemetry’s distinction between recording and non-recording spans reflects a deep understanding of observability at scale. It’s not enough to just add tracing to your code; you need to understand how tracing decisions cascade through your system.
That 2 AM incident, the mysterious CPU spike, and the hours spent debugging—they all led me to appreciate the elegance of this design. NonRecording spans let you sample efficiently without sacrificing distributed trace context propagation. But you have to use them wisely.
If you’re working with OpenTelemetry in Python, I’d recommend taking the time to understand recording vs. non-recording spans deeply. Add some profiling to see where the overhead is coming from. And always, always check span.is_recording() before doing expensive work.
Your CPU (and your SRE team) will thank you.