name: analyze-crash
description: Native crash log analysis for dd-trace-py
argument-hint:
disable-model-invocation: false
context: fork
agent: general-purpose
allowed-tools:
- Bash
- Read
- Grep
- Glob
- TodoWrite
- WebSearch
Native Crash Stack Trace Analysis for dd-trace-py
You are analyzing a native crash captured by dd-trace-py's crashtracker. Perform a comprehensive investigation to help engineers understand and triage the crash. The crashtracker captures all process-level native crashes, so the crashing code may or may not be in dd-trace-py itself — investigate the code and patterns thoroughly before making the attribution call.
Input Processing
The user will provide crash data in one of two ways:
Option A: Crash UUID (preferred)
The user provides a crash UUID from Datadog crash telemetry. Use the fetch script bundled with this skill to retrieve and process the full crash log:
cd .claude/skills/analyze-crash && dd-auth .venv/bin/python fetch_crash.py <UUID> [--tracer-version X.Y.Z] [--days 14]
This fetches the crash log from Datadog, processes the stack frames (demangling C++ symbols,
resolving addresses against /proc/self/maps), and outputs structured JSON with:
- Metadata: tracer version, runtime version, service, org, platform, architecture, signal
- Processed stack: each frame with demangled symbol, binary file, offset, source file/line
- Binary mappings: summary of loaded shared objects
- Datadog log link: direct URL to the crash event
If the user provides only a UUID with no other context, ask:
What is the crash UUID? Any additional filters (tracer version, org ID, language)?
Setup (one-time): run cd .claude/skills/analyze-crash && uv sync to create the venv and
install dependencies.
Authentication: use dd-auth as a command wrapper — it injects Datadog API credentials
into the environment for the wrapped process.
Option B: Pasted crash log or stack trace
The user pastes a raw crash log or stack trace directly. The format is typically:
0x7f6e81c51fb7 PyUnicode_AsUTF8AndSize
0x7f6e769d7850 pycbc::Connection::open_bucket(_object*)
0x7f6e81ccd60c
0x7f6e81c53239 _PyObject_Call
...
Frames may have: hex address only, hex address + symbol, or symbol + .so name + offset. Parse
all available information. If the paste includes full crash JSON (with error.stack and
files./proc/self/maps), extract the metadata and processed stack from it.
Determine the source revision
The crash may come from a released wheel or a branch other than main. GitHub links must point
at the exact revision that built the crashing binary so that line numbers match.
Ask the user: Before starting analysis, ask:
Do you know the commit SHA or release tag for the dd-trace-py version that crashed? (e.g.,
v2.18.0,abc1234). If not, I'll usemain.
Resolution order (use the first that succeeds):
- User-provided value — commit SHA or tag supplied in the prompt or in response to the
question above. Use it verbatim as
GIT_REF. - Crash metadata — if the crash was fetched via UUID and includes a
tracer_versionfield (e.g.,2.18.0), try the corresponding git tag (v2.18.0). Verify it exists:git rev-parse --verify "v2.18.0" 2>/dev/null. - Version string in the stack trace — if the wheel filename or ddtrace version appears in
the trace (e.g.,
ddtrace-2.18.0), try the corresponding git tag. - Current checkout — if the local repo is not on
main, usegit rev-parse HEADto get the current SHA. - Fallback — use
main.
Store the resolved value as GIT_REF and use it in all GitHub links for the rest of the
analysis. State which ref you are using and why at the top of the report (in Additional Context).
Analysis Workflow
GitHub Link Generation
When referencing files in dd-trace-py, always provide clickable GitHub links using the resolved
GIT_REF (see "Determine the source revision" above).
Format:
[filename:line](https://github.com/DataDog/dd-trace-py/blob/{GIT_REF}/path/to/file#Lline)
Examples (assuming GIT_REF=v2.18.0):
- Single line:
[_threads.cpp:226](https://github.com/DataDog/dd-trace-py/blob/v2.18.0/ddtrace/internal/_threads.cpp#L226) - Range:
[_threads.cpp:226-252](https://github.com/DataDog/dd-trace-py/blob/v2.18.0/ddtrace/internal/_threads.cpp#L226-L252)
Path construction: base URL is https://github.com/DataDog/dd-trace-py/blob/{GIT_REF}/,
then append the repo-relative path (strip any build-specific prefixes like /home/runner/work/,
/usr/local/lib/, etc.).
Phase 1: Parse & Classify Stack Frames
Extract all stack frames and classify each one using the categories below.
Frame Classification
CPython Runtime — the Python interpreter itself:
.soname:python3.X,libpython3.X.so,libpython3.X.so.1.0- Symbols:
_PyEval_EvalFrameDefault,PyGILState_Ensure,take_gil,new_threadstate,PyEval_RestoreThread,PyEval_SaveThread,Py_RunMain,PyObject_Call*,_Py_*,PyThread_exit_thread,_PyRuntime,_PyThreadState_*
dd-trace-py Native (_threads) — periodic thread / GIL management:
.soname:_threads.cpython-*.so- Symbols:
PeriodicThread_start,PeriodicThread_stop,PeriodicThread_join,PeriodicThread__periodic,PeriodicThread__before_fork,PeriodicThread__after_fork,PeriodicThread__on_shutdown,PeriodicThread_awake,PeriodicThread_dealloc,GILGuard,AllowThreads,PyRef,execute_native_thread_routine
dd-trace-py Native (_native / Rust) — crashtracker, data pipeline, tracer utilities:
.soname:_native.cpython-*.so(underddtrace/internal/native/)- Symbols:
crashtracker_*,TraceExporterPy::send,pyo3::*,rust_begin_unwind,rust_panic,libdd_crashtracker::*,datadog_profiling_ffi::*,ffe::*,ddsketch::*
dd-trace-py Native (_memalloc) — memory profiler:
.soname:_memalloc.cpython-*.so- Symbols:
memalloc_*,Datadog::Sample*(when called from memalloc context)
dd-trace-py Native (_ddup / _stack / libdd_wrapper) — profiling upload/sampling:
.sonames:_ddup.cpython-*.so,_stack.cpython-*.so,libdd_wrapper.cpython-*.so- Symbols:
ddup_*,Datadog::Profile*,Datadog::Sample*,Datadog::Uploader*,Datadog::Sampler*,Datadog::EchionSampler*,Datadog::StackRenderer*,echion_*
dd-trace-py IAST/AppSec Native — security instrumentation:
.soname:_native.cpython-*.so(underddtrace/appsec/),_stacktrace.cpython-*.so- Symbols:
taint_*,TaintEngine*,tainted_ops::*
dd-trace-py Cython — Python-level profiling helpers:
.sonames:_threading.cpython-*.so,_exception.cpython-*.so,_sampler.cpython-*.so,_lock.cpython-*.so,_task.cpython-*.so,_encoding.cpython-*.so- Symbols: Cython-generated (often
__pyx_*prefixes or module function names)
Third-Party Native (NOT dd-trace-py) — known third-party libs:
libev-*.so/libev.so.4— gevent's embedded libev (NOT dd-trace-py)_native__lib.cpython-*.so— Pyroscope's native extension (path containspyroscope/) Note: Pyroscope crashes appear in dd-trace-py crash telemetry because the crashtracker captures all process-level crashes._native__lib(double underscore, inpyroscope/) is ALWAYS Pyroscope, never dd-trace-py. dd-trace-py's module is_native(single underscore, inddtrace/internal/native/).anyhow::error::object_drop,py_spy::*,goblin::*— Pyroscope/py-spy Rust code- Customer application code, third-party Python extensions
OS/libc — system libraries:
libc.so.6,libpthread.so,libm.so- Symbols:
gsignal,abort,pthread_exit,pthread_mutex_lock,__clone3,start_thread,cfree,free,malloc
Create a classification table:
| # | Type | Symbol / Location | Description |
|---|---|---|---|
| 0 | {type} | {symbol} | {brief description} |
| ... |
Phase 2: Locate dd-trace-py Source Code
For each dd-trace-py frame that can be mapped to source code:
Map
.soname to source file:_threads.cpython-*.so→ddtrace/internal/_threads.cpp_native.cpython-*.so(inddtrace/internal/native/) →src/native/(Rust)_memalloc.cpython-*.so→ddtrace/profiling/collector/_memalloc.cpp_ddup.cpython-*.so→ddtrace/internal/datadog/profiling/ddup/_stack.cpython-*.so→ddtrace/internal/datadog/profiling/stack/libdd_wrapper.cpython-*.so→ddtrace/internal/datadog/profiling/dd_wrapper/
Find the function in the source file using Glob/Bash grep.
Read code with context: find the function body, read 10-15 lines before the relevant line, mark the crash point, read 5-10 lines after. Show enough to understand what the code is doing.
Format:
### Frame N: {symbol} ([{file}:{line}](GitHub link))
```cpp
// {file}:{start}-{end}
{code with relevant line marked: // >>> CRASH POINT <<<}
```
**Analysis**: {what this code does and why it may have failed}
Phase 3: Apply General Crash Heuristics
dd-trace-py maintains two complementary native code safety references:
docs/native-code-review.md— full incident-derived rules with historical PR links (10 sections).cursor/rules/native-code.mdc— compact triage checklist with trigger symbols and stop conditions
Read both files now. Use their framework to classify the crash. The categories below summarize the key stack-trace signals; refer to the docs for full rationale and PR history.
3.1 — GIL Lifecycle During Finalization
Stack signals: PyGILState_Ensure, PyEval_RestoreThread, take_gil, new_threadstate,
PyThread_exit_thread, pthread_exit, abort, gsignal, __forced_unwind,
abi::__cxa_throw, std::terminate — especially combined with thread-entry frames like
execute_native_thread_routine or Rust extern "C" boundaries.
What to check:
- Is there a
py_is_finalizing()/is_finalizing()check immediately before each GIL acquire/restore call? These checks are inherently TOCTTOU — finalization can begin in the window between the check and the call. - Does the crash involve C++ RAII unwinding?
pthread_exiton glibc throwsabi::__forced_unwind; any C++ destructor orcatch(...)that does not re-throw it will causestd::terminate→SIGABRT. On musl (Alpine Linux),pthread_exituseslongjmpinstead —catch(abi::__forced_unwind&)never fires. - For Rust/pyo3 code:
__forced_unwindcannot propagate throughextern "C"boundaries. Rust code that callsPyEval_RestoreThread(or usespy.detach()) needs its own finalization check — the C++ caller'stry/catchcannot protect it.
CPython version behavior (from docs/native-code-review.md):
| CPython | Behavior when take_gil() detects finalization |
|---|---|
| 3.12 (all) | PyThread_exit_thread() → pthread_exit() → crash |
| 3.13.0–3.13.7 | Same as 3.12 |
| 3.13.8+ | PyThread_hang_thread() → pause() forever (hang, not crash) |
| 3.14+ | Hangs; PyGILState_Ensure itself hangs safely in 3.15+ |
Key source locations: ddtrace/internal/_threads.cpp (GILGuard, AllowThreads,
PeriodicThread_start), ddtrace/internal/threads.py (atexit handler),
src/native/data_pipeline/mod.rs (Rust pyo3 GIL release/restore).
3.2 — Fork Safety
Stack signals: pthread_mutex_lock, pthread_cond_wait, __lll_lock_wait deep in a
thread that recently went through _after_fork, __clone3 / start_thread in a crash that
occurs shortly after a fork event.
What to check:
- Any lock, mutex, or condition variable held by a parent thread is in an undefined state in
the child after
fork(). Child path must recreate primitives; parent path must clear them (other threads still hold references). - Were all periodic threads stopped and joined before the fork? Was
_before_fork()called? - Did a thread fail to stop (blocked in I/O) and get inherited into the child while still holding a lock?
Key source locations: ddtrace/internal/_threads.cpp (_before_fork, _after_fork,
safe_reset_thread), ddtrace/internal/threads.py (_before_fork, _after_fork_child).
3.3 — Object Lifetime Across Thread Boundaries (Use-After-Free)
Stack signals: SIGSEGV or SIGBUS at a CPython refcount operation (Py_INCREF,
Py_DECREF, _Py_Dealloc), or inside a PyObject* dereference inside a native thread
shortly after the thread started or while it was shutting down.
What to check:
- Was
Py_INCREFcalled beforestd::threadcreation? If it happens inside the lambda, there is a window where the object can be freed before the thread increments the refcount. - Is
Py_DECREFcalled after the thread has signaled completion (e.g., afterstopped_event->set())? On CPython 3.14+,_Py_Deallocdereferenceststateimmediately —Py_DECREFwith a NULL tstate (during finalization) crashes. - Does
PyRefcorrectly defer the decrement until outside the finalization window?
Key source locations: ddtrace/internal/_threads.cpp
3.4 — Memory Safety in Allocator Hooks
Stack signals: Crash inside _memalloc.cpython-*.so with CPython allocation functions
(PyMem_Malloc, PyObject_Malloc) or frame-walking functions (PyThreadState_GetFrame,
PyFrame_GetBack, PyFrame_GetCode, PyUnicode_AsUTF8AndSize) in the trace — especially
if the crash is a heap corruption SIGABRT or a SIGSEGV inside a seemingly unrelated function.
What to check:
- Is the crashing code executing inside a
malloc/free/reallochook? Any Python C API call that can allocate or trigger GC is unsafe there — it causes reentrant hook calls that corrupt internal state. - Safe functions inside hooks:
PyThread_get_thread_ident(),Py_INCREF, direct struct field reads. Unsafe:Py_DECREF,PyFrame_GetBack(),PyUnicode_AsUTF8AndSize(),PyObject_CallObject(). - Is GC explicitly disabled? Verify it is actually compiled in (a previous fix was silently
dead code due to a missing
#include).
Key source locations: ddtrace/profiling/collector/_memalloc.cpp,
ddtrace/profiling/collector/_memalloc_tb.cpp
3.5 — FFI Exception and Unwind Propagation
Stack signals: rust_begin_unwind, rust_panic, __rust_start_panic in frames from
_native.cpython-*.so; or abi::__cxa_throw / __cxa_rethrow crossing between .so
files; or std::terminate called from inside Rust code.
What to check:
abi::__forced_unwind(frompthread_exit) cannot propagate through Rustextern "C"boundaries — Rust aborts on foreign exceptions.- Does any Rust function call
PyEval_RestoreThreador usepy.detach()without a finalization check? It needs its own guard; the surrounding C++try/catchcannot help. - Does a C++
catch(...)exist that swallows__forced_unwindwithout re-throwing? That callsstd::terminate. - Does a native function call
getenv()or (in Rust)std::env::var()? These are not thread-safe when another thread modifies the environment concurrently — can cause SIGSEGV.
Key source locations: src/native/data_pipeline/mod.rs, ddtrace/internal/_threads.cpp
(thread lambda catch blocks).
3.6 — Unbounded Loops and Recursion
Stack signals: Deep or repeating identical frames (same function appearing many times),
deadlock-style hangs with a lock held (visible in thread dumps as pthread_mutex_lock with
no progress), or SIGABRT from std::terminate after a recursive call depth is exceeded.
What to check:
- Is there a loop over a CPython internal linked list (interpreters, threads, greenlets) with no upper bound? Corrupted or cyclic structures cause infinite loops.
- Can any function called from instrumentation code acquire a lock that is also instrumented (e.g., a lock profiler that logs, logging that acquires a lock)?
- Is there an explicit iteration limit?
Key source locations: Any loop over interp->next, greenlet chains, or stack chunks in
ddtrace/internal/datadog/profiling/stack/.
3.7 — Cython Silent Exception Swallowing
Stack signals: Incorrect/unexpected behavior rather than a hard crash; or a crash inside
a Cython-generated .so (__pyx_* frames) where a NULL pointer is dereferenced after a
function returned 0/NULL with no exception set.
What to check:
- Does every
cdeffunction that can raise declareexcept *,except -1, orexcept? -1? Without this, exceptions are silently discarded and the caller proceeds with corrupt state. - Do
nogilblocks avoid all Python C API calls andPyObject*access? Only pure C/C++ operations are safe insidenogil— any Python API call there is undefined behaviour. - Are
PyUnicode_AsUTF8AndSizeresults stored asstring_view/ rawconst char*while the source Python object may have been garbage collected?
Key source locations: ddtrace/profiling/collector/_exception.pyx,
ddtrace/profiling/collector/_lock.pyx, ddtrace/internal/_encoding.pyx.
3.8 — Stripped Binary Misattribution
Stack signals: Symbol names that don't match the .so they appear in, or a crash at an
address that doesn't correspond to any named symbol in the frame (address-only frames at the
top of a crashing .so).
What to check:
- In stripped release builds, the topmost named frame may be the next exported symbol after the real crash site, not the function that actually crashed. The true crash may be in an inlined or stripped function above it.
- When comparing a crash address against source, verify against the pinned library version
(
src/native/Cargo.tomlfor libdatadog,setup.pyfor libddwaf) — not the latest. API and symbol layouts differ between versions. - If the reported function seems inconsistent with the rest of the stack, treat it as approximate and look at the surrounding frames for the true call site.
Phase 4: Attribute the Crash
Now that you have investigated the source code (Phase 2) and matched against known patterns (Phase 3), determine attribution: Is this a dd-trace-py bug?
Use these signals:
Frame ownership: Are any dd-trace-py native frames in the crashing thread? If the only dd-trace-py frames are in other threads (not the crashing one), or absent entirely, dd-trace-py may not be the root cause.
Known third-party library:
_native__lib.cpython-*.soin apyroscope/path → Pyroscope buglibev-*.soin a gevent-using app → gevent/libev interaction- Customer
.so(e.g.,pycbc,grpc) → third-party extension bug
Pattern match from Phase 3: Did the crash match a known dd-trace-py pattern?
Code evidence from Phase 2: Does the source code reveal a bug in dd-trace-py, or does it show dd-trace-py code operating correctly while a third-party or CPython component fails?
Web search for upstream issues: If the crash appears to involve CPython internals or a third-party library, use
WebSearchto check whether this is a known upstream bug:- Search for the crashing symbol + "crash" or "segfault" + the library/CPython version
- Check the CPython issue tracker (
github.com/python/cpython/issues) - Check the third-party library's issue tracker
- Search for related GitHub issues in
DataDog/dd-trace-py
Example searches:
"take_gil" crash CPython 3.12 pthread_exit site:github.com/python/cpythonPyThread_exit_thread finalization crash cpython issuesite:github.com/DataDog/dd-trace-py/issues SIGSEGV _memalloc
State the attribution clearly: dd-trace-py bug, third-party bug (name the library), CPython bug (cite the upstream issue if found), interaction/race between components, or unknown.
Phase 5: Reconstruct the Crash Flow
Build a narrative explaining the execution flow leading to the crash:
- Thread identity: What kind of thread is this? (main thread, PeriodicThread, OS thread)
- Entry point: Where did execution start? (thread start function, signal handler, etc.)
- Key operations: What was the thread trying to do?
- Critical transitions: Where did control flow between components?
- Failure point: Where and why did it crash?
- Crash type: null dereference, use-after-free, race condition (TOCTTOU),
abortviapthread_exit/C++ unwinding, double-free, assertion failure, etc.
Write this as a clear step-by-step narrative. Focus on HOW the crash happened based on the stack trace and code evidence. Do not prescribe specific fixes.
Phase 6: Find Related Code and History
Use Bash + git to find context. When GIT_REF is not main, scope history queries to that
ref so results reflect the code that actually shipped in the crashing binary.
- Recent commits to affected files (up to the crashing revision):
git log --oneline -10 {GIT_REF} -- {file} - Search for relevant changes:
git log --grep="GIL" --grep="finalize" --grep="crash" --oneline -20 {GIT_REF}(use keywords relevant to the crash area) - For each relevant commit, extract PR numbers from the message (e.g.,
(#7659)) and link:https://github.com/DataDog/dd-trace-py/pull/{number} - Grep for
AIDEV-comments near the crash site:grep -n "AIDEV" {file}— these are embedded rationale notes left specifically for AI and developers explaining non-obvious invariants. Read any that are near relevant functions.
Prioritize PR links — PRs have descriptions and discussion context that commit messages lack.
The historical examples in docs/native-code-review.md also contain PR links for past crashes
of each type — cross-reference them when a heuristic from Phase 3 matches.
Output Format
# Crash Analysis Report
**Generated**: {ISO 8601 timestamp}
**Source ref**: [`{GIT_REF}`](https://github.com/DataDog/dd-trace-py/tree/{GIT_REF}) {— reason, e.g. "from crash metadata v2.18.0" or "fallback to main"}
## Crash Metadata
{If fetched via UUID, include this section. Omit if the user pasted a raw stack trace.}
| Field | Value |
|-------|-------|
| Tracer version | {tracer_version} |
| Runtime version | {runtime_version} |
| Service | {service} |
| Platform / Arch | {platform} / {architecture} |
| Signal | {signal, e.g. SIGSEGV, SIGABRT} |
| Org ID | {org_id} |
| Log link | [{log_id}]({log_url}) |
## Executive Summary
{2-3 sentences: what crashed, which component, what the crashing thread was doing.
De-mystify the hex addresses — translate them to human-readable description of the crash.}
## Stack Trace Classification
| # | Type | Symbol / Location | Description |
|---|------|------------------|-------------|
| 0 | {type} | {symbol} | {brief desc} |
| ... | | | |
{Note any other threads if provided, but focus on the crashed thread.}
## Code Context
{For each critical dd-trace-py frame:}
### Frame N: {symbol} ([{file}:{line}](GitHub link))
```cpp
// {file}:{start}-{end}
{code with // >>> CRASH POINT <<< marker}
```
**Analysis**: {what the code does and why it crashed}
## Pattern Match
{If a known pattern from Phase 3 matched, name it and explain why.
If no known pattern matched, say so.}
## Attribution
**{dd-trace-py bug | third-party bug ({library name}) | CPython bug ({link to issue if found}) | interaction/race | unknown}**
{2-3 sentence justification referencing evidence from code investigation, pattern matching,
and web search results. Explain WHY this attribution was chosen.}
## Crash Flow Reconstruction
{Step-by-step narrative}
**Crash type**: {e.g., TOCTTOU race → null dereference in new_threadstate}
**How it happened**: {clear explanation based on stack trace + code}
## Related Code
**Relevant PRs and commits**:
- [#{number}](https://github.com/DataDog/dd-trace-py/pull/{number}): {title} — {why relevant}
**Upstream issues** (if any):
- [{issue title}]({url}) — {relevance}
**Related code locations**:
- [{file}:{line}](GitHub link) — {description}
## Additional Context
{Environment details, Python version, platform, feature flags in use, or other relevant background.}
---
*Analysis generated by Claude Code /analyze-crash*
*This analysis is intended for triage. Engineers should review before acting on it.*
Output File Management
- Create output directory:
mkdir -p .claude/skills/analyze-crash/reports - Generate filename:
crash-analysis-{YYYYMMDD-HHMMSS}.md - Save file: Write the markdown analysis to
.claude/skills/analyze-crash/reports/{filename} - Tell the user the path where the analysis was saved
Important Guidelines
- Investigate before attributing: The crashtracker captures all process crashes — many will be in third-party code (Pyroscope, customer extensions, gevent). Investigate the code and apply heuristics (Phases 2-3) before making the attribution call (Phase 4). The evidence you gather significantly improves attribution accuracy.
- Native frames are opaque: Stack frames often have only a hex address and symbol name, no
file:line. Use the symbol name +
.soname to classify and find source. _native__libvs_native:_native__lib(double underscore, inpyroscope/) is ALWAYS Pyroscope._native(single underscore, inddtrace/internal/native/) is dd-trace-py. Never confuse them.- Focus on triage and understanding: Explain HOW the crash happened; do not prescribe fixes.
- Describe crash types: Identify the category (null deref, race, abort, use-after-free) but stop before suggesting specific code changes.
- Continue with missing info: If a symbol can't be found in source, note it and continue.
- Mark uncertainties: If something is speculative, say so explicitly.
- Always include GitHub links: Every file:line reference should be a clickable link using the
resolved
GIT_REF— never hard-codemainunless it is the resolved ref. - Prefer PR links over commits: Extract
(#NNNN)from commit messages and link to the PR.
Now Analyze
Obtain the crash data (fetch by UUID or parse the pasted log), resolve the source revision, and follow the workflow above to generate a comprehensive crash analysis.