rr-debug

name: rr-debug description: Time-travel debugging with rr (Linux) or alternatives (macOS) - record test execution, replay deterministically, step backwards to find root cause of flaky tests and crashes license: MIT metadata: platform: linux,macos category: debugging

Time-Travel Debugging with rr

rr records program execution deterministically. When a test fails, you can replay it unlimited times, step backwards, and set watchpoints to find exactly when/where state changed.

Note: rr is Linux-only. For macOS, see the "macOS Alternatives" section at the end.

Setup (Linux)

# Install rr
apt install rr

# Enable perf events (required)
echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid

# Verify
rr --version

Quick Start - Valkyria Tests

# Record a single C test
RR=1 make test-c TEST=test_networking

# Record a single Valk test  
RR=1 make test-valk TEST=test/test_http_integration.valk

# Run until failure (for flaky tests)
make test-rr-until-fail TEST=test_networking MAX=100

# Replay the recording
rr replay

Recording Options

Basic Recording

# Record with default options
rr record ./build/test_networking

# Record with chaos mode (randomizes scheduling to expose races)
rr record --chaos ./build/test_networking

# Disable fork following (for forked test children)
VALK_TEST_NO_FORK=1 rr record ./build/test_networking

Recording Flaky Tests

# Loop until failure, keeping the failing recording
for i in {1..100}; do
  echo "Attempt $i..."
  if ! rr record --chaos ./build/test_networking; then
    echo "Failed on attempt $i - recording saved"
    break
  fi
done

Replay and Debug

Start Replay

# Replay most recent recording
rr replay

# Replay specific recording
rr replay ~/.local/share/rr/test_networking-42/

# Replay with specific GDB
rr replay -d /usr/bin/gdb

Essential GDB Commands in rr

# Run forward to crash/end
continue

# Run BACKWARDS to previous stop
reverse-continue

# Step backwards (reverse single step)
reverse-step
reverse-stepi
reverse-next
reverse-nexti

# Go to specific event number
run 12345

# Show current event number  
when

# Set hardware watchpoint (stops when value changes)
watch -l some_variable
# Then reverse-continue to find who changed it

# Find when a memory location was written
watch *(int*)0x7fff5678
reverse-continue

# Checkpoints (save/restore points in time)
checkpoint
restart 1

Common Debugging Workflows

Find Who Corrupted Memory

# 1. Run to crash
continue

# 2. Inspect corrupted state
print *corrupted_ptr

# 3. Set watchpoint on the memory
watch -l *corrupted_ptr

# 4. Go backwards to find the write
reverse-continue

# 5. You're now at the exact instruction that corrupted it
bt

Find Race Condition

# 1. Run to failure point
continue

# 2. Note the unexpected value
print shared_counter

# 3. Watch the variable
watch -l shared_counter

# 4. Reverse to each modification
reverse-continue  # First write
reverse-continue  # Second write (the race!)
info threads
bt

Find Deadlock

# 1. Interrupt hung process
Ctrl+C

# 2. See all threads
info threads
thread apply all bt

# 3. Find lock state
print mutex_variable

# 4. Watch the lock
watch -l mutex_variable.__data.__lock

# 5. Reverse to see acquisition order
reverse-continue

rr Tips

Performance

rr adds ~1.5x slowdown (much less than Valgrind)
Recording size: ~1GB per minute of execution
Keep recordings small: test minimal reproducer

Limitations

Linux only (no macOS, no Windows)
x86/x86_64 only (no ARM)
No GPU/hardware access
Some syscalls not supported

Chaos Mode

# Randomize scheduling to expose races
rr record --chaos ./test

# More aggressive chaos
rr record --chaos --wait ./test

Environment Variables

# Disable ASLR (sometimes needed)
setarch $(uname -m) -R rr record ./test

# Record with specific CPU features disabled
rr record --disable-cpuid-features=0xffffffff ./test

Valkyria-Specific Notes

Fork-Based Tests

The test framework forks for isolation. rr follows forks by default, but this can complicate debugging:

# Option 1: Disable fork in tests
VALK_TEST_NO_FORK=1 rr record ./build/test_networking

# Option 2: Let rr follow forks (default)
rr record ./build/test_networking
# In replay, switch to child process:
rr replay -f <child-pid>

GC/Threading Issues

For GC coordination bugs:

# Watch GC phase transitions
watch -l valk_gc_coord.phase
reverse-continue

# Watch thread pause count
watch -l valk_gc_coord.threads_paused

Async/Event Loop Issues

For libuv event loop bugs:

# Break on event loop iteration
break uv_run

# Watch pending handle count
watch -l loop->active_handles

Recordings Location

# Default location
ls ~/.local/share/rr/

# List recordings
rr ls

# Delete old recordings
rr rm test_networking-42
rr rm --all  # Delete everything

Integration with CI

For flaky tests in CI, record failures for later analysis:

#!/bin/bash
# ci-test-with-rr.sh

if ! rr record --chaos ./build/$TEST; then
  # Upload recording as artifact
  tar czf "rr-$TEST-$(date +%s).tar.gz" ~/.local/share/rr/latest-trace/
  exit 1
fi

macOS Alternatives

macOS doesn't have rr. Use these approaches instead:

Core Dumps with LLDB

# Enable core dumps
sudo sysctl kern.coredump=1

# Core dumps go to /cores/
ls -la /cores/

# Debug a core dump
lldb -c /cores/core.12345 build/test_networking

Run Until Failure + LLDB

# Loop until failure
for i in {1..100}; do
  echo "Iteration $i..."
  build/test_networking || { echo "Failed on $i"; break; }
done

# Then debug interactively
lldb -- build/test_networking

Instruments for Profiling

# Time Profiler (find CPU hotspots)
xcrun xctrace record --template 'Time Profiler' --launch -- build/test_networking

# System Trace (syscalls, scheduling)
xcrun xctrace record --template 'System Trace' --launch -- build/test_networking

# Open in Instruments.app
open *.trace

LLDB Watchpoints (Manual Time-Travel)

lldb -- build/test_networking

# Set watchpoint on variable
watchpoint set variable some_global

# When it triggers, examine state
bt
frame variable

# Unfortunately no reverse-continue on macOS
# But you can re-run with conditional breakpoints
breakpoint set -n some_function -c 'counter > 50'

Address Sanitizer

ASAN works on both Linux and macOS and is often sufficient:

make test-c-asan TEST=test_networking
make test-asan-abort  # Generates core dump on failure