name: debug-missing-spans description: > Troubleshoot when expected OpenTelemetry spans don't reach the backend. Walks the chain top-to-bottom — code → SDK init → processor → exporter → network → backend ingest — with concrete tests at each step. Covers head sampling, ctx.waitUntil drops on Cloudflare, init-order races, runtime detection failures, propagation breaks, exporter auth errors, and silent ratelimits. license: MIT
Debug missing spans
When a span you expect isn't in the backend, the cause is somewhere in this chain:
code → SDK init → head sampler → processor → exporter → network → backend ingest → backend index
This skill walks each link in order with a quick check you can run. Don't skip steps — the cause is rarely where you'd guess.
Step 0: Reproduce locally with the pretty exporter
Before chasing remote backends, confirm the span exists at all:
init({
service: 'my-app',
debug: 'pretty', // hierarchical colourised output to stdout
});
If you see the span in stdout, the SDK + sampler are fine — skip to "exporter / network". If you don't, keep reading.
Step 1: Is the SDK actually initialised?
Common failure: init() runs after the first request because of import-order.
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('autotel-debug');
console.log(
'[autotel-debug] tracer is no-op:',
tracer.constructor.name === 'NoopTracer',
);
If true, init() ran too late. Move it to the very top of the entry file (or to instrumentation.ts for Next.js).
Step 2: Head sampler
Print the effective head rate:
import { getActiveConfig } from 'autotel-edge';
console.log('[autotel-debug] sampling:', getActiveConfig()?.sampling);
Common gotchas:
sampling.rates: { server: 5 }— 5 % means 95 % of spans never start.- Inheriting
OTEL_TRACES_SAMPLER_ARG=0.01from the environment via the OTel default sampler. - Your test happens to hit the unsampled branch — instrument with
sampling: { rates: { server: 100 } }while reproducing.
To force sampling for one request, send a traceparent with the sampled flag set:
traceparent: 00-<traceid>-<spanid>-01
(-01 at the end = sampled.) autotel's parent-based sampler will respect it.
Step 3: Cloudflare Workers — ctx.waitUntil
The single biggest cause of missing spans on the edge: the response returned before the exporter flushed.
If you're using addEventListener('fetch', …) or a hand-rolled fetch in a module worker without wiring ctx.waitUntil(…) to the export call, async drains drop silently.
Fix — switch to defineWorkerFetch or wrapModule, both of which wire waitUntil automatically:
import { defineWorkerFetch } from 'autotel-cloudflare';
export default defineWorkerFetch(
{ service: { name: 'edge' } },
async (request, env, ctx, log) => {
// log.set / spans here all flush via ctx.waitUntil before response returns
return new Response('ok');
},
);
Step 4: Processor pipeline
Print what's wired:
import { trace } from '@opentelemetry/api';
const provider = trace.getTracerProvider();
console.log('[autotel-debug] provider:', provider.constructor.name);
console.log(
'[autotel-debug] processors:',
(provider as any)._registeredSpanProcessors?.map(
(p: any) => p.constructor.name,
),
);
Common issues:
- A
FilteringSpanProcessorexcludes your span. Check theinclude/excludepredicates. - A
TailSamplingProcessordropped the trace (no error, no slow root, no debug header). - A
composePostProcessorsstep returns[]for your span.
To bisect, temporarily strip post-processors:
init({
service: 'my-app',
exporter: { url: process.env.OTLP_ENDPOINT! },
// no postProcessor, no tail sampler, no filter
});
If the span shows up now, add back the processors one at a time.
Step 5: Exporter
Tail the SDK's diagnostic log:
import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api';
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);
Look for:
@opentelemetry/api: ... OTLPExporter: failed to send 4 traces, status: 401, error: ...
Common exporter errors:
| Status | Meaning | Fix |
|---|---|---|
401 |
Bad / missing auth header | Check OTLP_HEADERS / vendor token name |
403 |
Token has no write scope | Issue a token with the right scope |
404 |
Wrong endpoint URL | Check region (api.honeycomb.io vs api.eu1.honeycomb.io) |
413 |
Batch too big | Lower BatchSpanProcessor maxExportBatchSize |
429 |
Rate-limited | Reduce head/tail rates; honour retry-after |
502/503/504 |
Upstream unhealthy | Often transient; add retries; check backend status |
| Network error | DNS / firewall | curl -v <url> from the same network |
Step 6: Network / TLS
For self-hosted Collectors:
curl -v -X POST $OTLP_ENDPOINT \
-H 'content-type: application/json' \
-H "$AUTH_HEADER" \
-d '{"resourceSpans":[]}'
Should return 200. If it doesn't, the problem is between you and the Collector — not autotel.
For Cloudflare Workers, run wrangler tail and look for OTLPExporter errors.
Step 7: Backend ingest — silent rejection
Some backends accept the request with a 200 but drop the events:
- Honeycomb: dataset must exist and the API key must have write access to it. Mismatched key/dataset → silent drop.
- Datadog: check
serviceis set (resource attributeservice.name) — they ignore spans without it. - Sentry: SDK version mismatch on envelope → 200 but events disappear.
- Grafana Cloud Tempo: spans without
service.namego to a fallback service calledunknown_service.
For each backend, the dataset / index / project where you'd expect the span:
| Backend | Where the span lands |
|---|---|
| Honeycomb | dataset = service.name (auto-created) |
| Datadog | service:<name> filter |
| Grafana Tempo | search by traceId |
| Jaeger | service dropdown = service.name |
| Sentry | project linked to the DSN |
Step 8: Backend index lag
After a 200, expect ingestion lag of:
| Backend | Typical lag |
|---|---|
| Honeycomb | < 5 s |
| Datadog | 30–60 s |
| Grafana Tempo | 10–30 s |
| Sentry | 30–120 s |
| Self-hosted Jaeger | < 1 s |
Don't conclude the span is missing until you've waited > 2× the expected lag.
Step-by-step checklist
[ ] Span shows in `debug: 'pretty'` stdout
[ ] `tracer.constructor.name !== 'NoopTracer'` (SDK initialised)
[ ] Head rate is high enough to allow the request
[ ] Workers handler uses defineWorkerFetch / wrapModule
[ ] No post-processor / tail sampler / filter strips it
[ ] Exporter logs no 4xx/5xx
[ ] Curl to OTLP endpoint returns 200
[ ] Backend has the right service.name / dataset / project
[ ] Waited 2× expected ingest lag
When the trace partially shows up
Some spans land, some don't:
- Trace context broken between services — outbound HTTP calls aren't propagating
traceparent. Confirm autotel's global fetch instrumentation is on (instrumentation.instrumentGlobalFetch: true, default). - Async boundary loses context — a
setTimeout/ queue callback ran outside the AsyncLocalStorage scope. Wrap withtrace()or usecontext.with(). - Cross-runtime call — Node service → Workers → browser; verify
traceparentarrives at each leg via response headers / network panel.
When the SDK itself crashes
TypeError: Cannot read properties of undefined (reading 'startActiveSpan')
Usually means the API version (@opentelemetry/api) and SDK version (@opentelemetry/sdk-trace-base) drifted. Run:
pnpm why @opentelemetry/api
There should be exactly one resolved version. If there are two, dedup via pnpm.overrides.
Anti-patterns to fix as you debug
| Anti-pattern | Why it loses spans |
|---|---|
init() after the first import that uses tracing |
Spans before init() are no-ops |
addEventListener('fetch', …) on Workers |
Pre-module-worker style; no ctx.waitUntil to wire |
Single OTLP_ENDPOINT env var with ? chars URL-encoded |
Auth gets parsed as part of the path |
Importing both @sentry/tracing and autotel |
Double-instrumentation eats spans |
process.exit(0) immediately after the work |
The exporter never flushed; call await provider.shutdown() first |