debug-missing-spans

name: debug-missing-spans description: > Troubleshoot when expected OpenTelemetry spans don't reach the backend. Walks the chain top-to-bottom — code → SDK init → processor → exporter → network → backend ingest — with concrete tests at each step. Covers head sampling, ctx.waitUntil drops on Cloudflare, init-order races, runtime detection failures, propagation breaks, exporter auth errors, and silent ratelimits. license: MIT

Debug missing spans

When a span you expect isn't in the backend, the cause is somewhere in this chain:

code → SDK init → head sampler → processor → exporter → network → backend ingest → backend index

This skill walks each link in order with a quick check you can run. Don't skip steps — the cause is rarely where you'd guess.

Step 0: Reproduce locally with the pretty exporter

Before chasing remote backends, confirm the span exists at all:

init({
  service: 'my-app',
  debug: 'pretty', // hierarchical colourised output to stdout
});

If you see the span in stdout, the SDK + sampler are fine — skip to "exporter / network". If you don't, keep reading.

Step 1: Is the SDK actually initialised?

Common failure: init() runs after the first request because of import-order.

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('autotel-debug');
console.log(
  '[autotel-debug] tracer is no-op:',
  tracer.constructor.name === 'NoopTracer',
);

If true, init() ran too late. Move it to the very top of the entry file (or to instrumentation.ts for Next.js).

Step 2: Head sampler

Print the effective head rate:

import { getActiveConfig } from 'autotel-edge';
console.log('[autotel-debug] sampling:', getActiveConfig()?.sampling);

Common gotchas:

sampling.rates: { server: 5 } — 5 % means 95 % of spans never start.
Inheriting OTEL_TRACES_SAMPLER_ARG=0.01 from the environment via the OTel default sampler.
Your test happens to hit the unsampled branch — instrument with sampling: { rates: { server: 100 } } while reproducing.

To force sampling for one request, send a traceparent with the sampled flag set:

traceparent: 00-<traceid>-<spanid>-01

(-01 at the end = sampled.) autotel's parent-based sampler will respect it.

Step 3: Cloudflare Workers — `ctx.waitUntil`

The single biggest cause of missing spans on the edge: the response returned before the exporter flushed.

If you're using addEventListener('fetch', …) or a hand-rolled fetch in a module worker without wiring ctx.waitUntil(…) to the export call, async drains drop silently.

Fix — switch to defineWorkerFetch or wrapModule, both of which wire waitUntil automatically:

import { defineWorkerFetch } from 'autotel-cloudflare';

export default defineWorkerFetch(
  { service: { name: 'edge' } },
  async (request, env, ctx, log) => {
    // log.set / spans here all flush via ctx.waitUntil before response returns
    return new Response('ok');
  },
);

Step 4: Processor pipeline

Print what's wired:

import { trace } from '@opentelemetry/api';
const provider = trace.getTracerProvider();
console.log('[autotel-debug] provider:', provider.constructor.name);
console.log(
  '[autotel-debug] processors:',
  (provider as any)._registeredSpanProcessors?.map(
    (p: any) => p.constructor.name,
  ),
);

Common issues:

A FilteringSpanProcessor excludes your span. Check the include / exclude predicates.
A TailSamplingProcessor dropped the trace (no error, no slow root, no debug header).
A composePostProcessors step returns [] for your span.

To bisect, temporarily strip post-processors:

init({
  service: 'my-app',
  exporter: { url: process.env.OTLP_ENDPOINT! },
  // no postProcessor, no tail sampler, no filter
});

If the span shows up now, add back the processors one at a time.

Step 5: Exporter

Tail the SDK's diagnostic log:

import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api';
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);

Look for:

@opentelemetry/api: ... OTLPExporter: failed to send 4 traces, status: 401, error: ...

Common exporter errors:

Status	Meaning	Fix
`401`	Bad / missing auth header	Check `OTLP_HEADERS` / vendor token name
`403`	Token has no write scope	Issue a token with the right scope
`404`	Wrong endpoint URL	Check region (`api.honeycomb.io` vs `api.eu1.honeycomb.io`)
`413`	Batch too big	Lower `BatchSpanProcessor` `maxExportBatchSize`
`429`	Rate-limited	Reduce head/tail rates; honour `retry-after`
`502/503/504`	Upstream unhealthy	Often transient; add retries; check backend status
Network error	DNS / firewall	`curl -v <url>` from the same network

Step 6: Network / TLS

For self-hosted Collectors:

curl -v -X POST $OTLP_ENDPOINT \
  -H 'content-type: application/json' \
  -H "$AUTH_HEADER" \
  -d '{"resourceSpans":[]}'

Should return 200. If it doesn't, the problem is between you and the Collector — not autotel.

For Cloudflare Workers, run wrangler tail and look for OTLPExporter errors.

Step 7: Backend ingest — silent rejection

Some backends accept the request with a 200 but drop the events:

Honeycomb: dataset must exist and the API key must have write access to it. Mismatched key/dataset → silent drop.
Datadog: check service is set (resource attribute service.name) — they ignore spans without it.
Sentry: SDK version mismatch on envelope → 200 but events disappear.
Grafana Cloud Tempo: spans without service.name go to a fallback service called unknown_service.

For each backend, the dataset / index / project where you'd expect the span:

Backend	Where the span lands
Honeycomb	dataset = `service.name` (auto-created)
Datadog	`service:<name>` filter
Grafana Tempo	search by `traceId`
Jaeger	service dropdown = `service.name`
Sentry	project linked to the DSN

Step 8: Backend index lag

After a 200, expect ingestion lag of:

Backend	Typical lag
Honeycomb	< 5 s
Datadog	30–60 s
Grafana Tempo	10–30 s
Sentry	30–120 s
Self-hosted Jaeger	< 1 s

Don't conclude the span is missing until you've waited > 2× the expected lag.

Step-by-step checklist

[ ] Span shows in `debug: 'pretty'` stdout
[ ] `tracer.constructor.name !== 'NoopTracer'` (SDK initialised)
[ ] Head rate is high enough to allow the request
[ ] Workers handler uses defineWorkerFetch / wrapModule
[ ] No post-processor / tail sampler / filter strips it
[ ] Exporter logs no 4xx/5xx
[ ] Curl to OTLP endpoint returns 200
[ ] Backend has the right service.name / dataset / project
[ ] Waited 2× expected ingest lag

When the trace partially shows up

Some spans land, some don't:

Trace context broken between services — outbound HTTP calls aren't propagating traceparent. Confirm autotel's global fetch instrumentation is on (instrumentation.instrumentGlobalFetch: true, default).
Async boundary loses context — a setTimeout / queue callback ran outside the AsyncLocalStorage scope. Wrap with trace() or use context.with().
Cross-runtime call — Node service → Workers → browser; verify traceparent arrives at each leg via response headers / network panel.

When the SDK itself crashes

TypeError: Cannot read properties of undefined (reading 'startActiveSpan')

Usually means the API version (@opentelemetry/api) and SDK version (@opentelemetry/sdk-trace-base) drifted. Run:

pnpm why @opentelemetry/api

There should be exactly one resolved version. If there are two, dedup via pnpm.overrides.

Anti-patterns to fix as you debug

Anti-pattern	Why it loses spans
`init()` after the first import that uses tracing	Spans before `init()` are no-ops
`addEventListener('fetch', …)` on Workers	Pre-module-worker style; no `ctx.waitUntil` to wire
Single `OTLP_ENDPOINT` env var with `?` chars URL-encoded	Auth gets parsed as part of the path
Importing both `@sentry/tracing` and `autotel`	Double-instrumentation eats spans
`process.exit(0)` immediately after the work	The exporter never flushed; call `await provider.shutdown()` first