kernel-debug-orchestrator - SKILL.md Agent Skill

name: kernel-debug-orchestrator description: Orchestrate iterative Linux kernel debugging in learn_os style workspaces by chaining QEMU bring-up, targeted instrumentation, build, test execution, and evidence-driven next-iteration decisions. Use this when tasks are kernel debugging loops (for example writeback/f2fs/fscrypt crashes), and you need a repeatable cycle that depends on f2fs-qemu-agent-pipeline plus kernel-log-instrumentor.

Kernel Debug Orchestrator

Use this skill for end-to-end kernel debug loops where you must repeatedly:

start/verify guest runtime,
add or refine kernel logs,
build,
run reproduction tests,
analyze logs,
update instrumentation and repeat.

Hard dependencies (fixed)

Always load these skills in this order:

f2fs-qemu-agent-pipeline
kernel-log-instrumentor

Optional third skill based on test type:

xfstests-qga-ubuntu when the reproduction is xfstests/QGA-only.

Do not skip the first two dependencies for this skill.

Scope and assumptions

Workspace resembles learn_os with .vars.sh and QEMU helpers under .agents/tools/.
Kernel source is a git repo (for temporary debug branch + one reversible commit).
Guest access may use SSH or QGA; choose per f2fs-qemu-agent-pipeline policy.

Workflow Contract

Main Workflow

Normalize context and identify the source tree, O= build tree, target subsystem, and reproducer.
Verify runtime prerequisites that gate the target path before spending time on workload debugging.
Decide whether this round is sample-first, static-first, or log-first, and record that choice in the iteration output.
Apply or refine instrumentation, then build the changed kernel.
Run the reproducer and capture bounded evidence.
Convert long-running status and notable transitions into compact event-layer artifacts before broad analysis.
Correlate evidence and choose the smallest next iteration.

Decision Table

Phase	Trigger / Symptom	Action	Verify	On Failure	Workflow Effect
Preflight	Repro depends on post-write fs-verity enablement, verity reads, or verity-tagged writeback coverage	In the guest, verify `CONFIG_FS_VERITY=y` from `/proc/config.gz` or `/boot/config-*`, then run a fresh plain-file `f2fs_io set_verity` probe on a fresh `mkfs.f2fs -O encrypt,verity` image mounted with the intended options	Guest config shows `CONFIG_FS_VERITY=y` and the plain-file probe succeeds	Treat the run as blocked by kernel config/runtime prerequisites; enable the config in the host build output, rebuild, reboot, and re-run preflight before touching workload logic	block

Output Contract

phase reached:
evidence mode:
decision path taken:
verification evidence:
checkpoint path:
fallback used:
unresolved blocker:
next workflow step:

Orchestration loop

Step 0: Normalize context

Source .vars.sh.
Identify source tree (e.g. $BASE/f2fs) and O= build tree (e.g. $BASE/f2fs_upstream).
Confirm target function(s), subsystem, and reproduction script.

Step 1: Boot and verify VM readiness

Start QEMU in non-blocking/reusable way per f2fs-qemu-agent-pipeline.
Verify process, control plane reachability (QGA/SSH), required mounts, and test paths.
Record command + status + evidence path.

Step 2: Instrumentation planning

Use kernel-log-instrumentor rules:

create a temporary debug branch,
keep logs in one commit,
default to pr_debug + dynamic_debug,
include __func__ and stable log prefix.
if the user needs to follow one shared object across many threads, explicitly switch to table-friendly k=v logging and plan the query commands up front.

Guardrail: new log lines must not introduce fresh pointer dereference risk. Prefer printing raw pointers/flags/scalars first.

Step 3: Apply instrumentation

Patch minimal callsites around suspected failure path (entry, branch decision, error path, state transition).
For concurrency/state counters, log before/after updates with clear labels.

Step 4: Build verification

Object-level checks first (for changed .c files, use kobj from .vars.sh).
Then full image build (bash $SCRIPT/make_upstream.sh).
Report exact build log path and pass/fail.

Pixel/Slider variant:

If the workspace uses private/google-modules/soc/gs/build_slider.sh, run it from the pixel/ repo root, not from private/google-modules/soc/gs/.
Reason: the script executes tools/bazel via a cwd-relative path; invoking it from the subdirectory fails before compilation with tools/bazel: No such file or directory.
If using a repo-root lane wrapper such as ./build_slider.sh --lane my_dec, first run ./build_slider.sh --lane my_dec --dry-run and record workspace=, common=, output_root=, and dist=.
For my_dec, the source should resolve through out/workspaces/slider_my_dec/common -> common_my_dec, while final images land under out/workspaces/slider_my_dec/out/slider/dist, not the root out/slider/dist.
If Bazel dies during server startup with channel not registered to an event loop, classify it as a build-environment blocker first, not a source compile result. Capture the out/bazel/.../server/jvm.out path in the report.
For compile-only sanity checks, prefer tools/bazel build ... (not build_slider.sh, which is a bazel run wrapper). Reference: references/pixel-slider-bazel-build.md.
For “make a new Kconfig symbol land in boot.img via Kleaf fragments”, follow: references/pixel-kleaf-config-fragment-bootimg.md.

Step 5: Run reproduction

Support two reproduction modes:

Existing script mode:
- run the known test script (e.g. rw_matrix.sh) with deterministic env.
On-demand script mode:
- create a minimal one-off reproducer under $TEST when no usable script exists.

For long tests via QGA:

redirect output to guest file,
tail/log-scan separately,
never equate QGA timeout with test failure until process/log state is checked.
if the run is long enough to survive context compression or handoff, write a compact checkpoint before expanding analysis.

Step 6: Collect and correlate evidence

Collect at minimum:

test stdout/stderr log,
kernel console log (guest_console.log),
filtered debug lines by stable prefix,
first failure stack trace and surrounding window.

Correlation requirement:

identify the last N debug lines before first Oops/BUG,
map to function + decision branch + key state fields.
when logs are table-friendly, query them as structured rows before doing free-form reading:
- generic: python3 /home/nzzhao/.agents/skills/kernel-log-instrumentor/scripts/kernel_log_kv_query.py <log> ...
- existing F2FS WBDBG/sysrq logs: bash /home/nzzhao/.agents/skills/kernel-log-instrumentor/scripts/f2fs_log_field_query.sh <log> ...
when the local workspace has the phase-1 event-layer helpers, prefer:
- python3 /home/nzzhao/learn_os/scripts/kernel_log_chain_packet.py ...
- python3 /home/nzzhao/learn_os/scripts/kernel_debug_emit_job_event.py ...
- python3 /home/nzzhao/learn_os/scripts/kernel_debug_write_checkpoint.py ...
if a background job already emits heartbeat.json, event.json, or checkpoint.md, consume those first and only escalate to raw streams when the packet/checkpoint reports a missing edge.
when sysrq dumps are present, run pid/ino correlation script:
- /home/nzzhao/learn_os/scripts/f2fs_pid_ino_correlate.sh <kernel_stream.txt> 3 40
- use output to align blocked pid with nearby [WBDBG] pid/ino activity.
- prioritize comm-cluster evidence (PackageManager*, android.bg, android.io) when same-pid WBDBG is sparse.

Step 7: Decide next iteration

Classify and act:

insufficient signal: add/refine logs and repeat from Step 2.
clear root-cause candidate: propose fix patch and validate with same repro.

References

/home/nzzhao/.agents/skills/kernel-debug-orchestrator/references/f2fs-largefolio-gc-porting-playbook.md
references/pixel-kleaf-config-fragment-bootimg.md for wiring a new Kconfig symbol into boot.img in Pixel/Kleaf.
references/git-cross-repo-porting-format-patch.md for porting a commit across unrelated repos with git format-patch / git am.
non-deterministic: tighten repro and add ordering/state logs.

Keep each iteration bounded; avoid broad logging expansion without evidence.

Output contract per iteration

Use this structure every loop:

iteration: integer
goal: what this round tries to prove/disprove
evidence mode: sample-first / static-first / log-first
instrumentation: files/functions and why
build: command + status + log path
repro: command + status + log path
event artifacts: heartbeat/event/checkpoint/log-chain-packet paths when they exist
findings: concrete evidence and inference
next action: smallest high-value next step

Stop conditions

Stop loop only when one is true:

root cause is evidenced with high confidence and patch direction is clear, or
current instrumentation cannot progress and a specific external dependency is missing.

If blocked, report exact blocker and minimal unblock command.

Reference playbooks

references/f2fs-write-end-io-playbook.md for writeback/compression/folio-private style crashes.
references/f2fs-pid-ino-correlation-playbook.md for sysrq blocked-stack and WBDBG pid/ino correlation.

Lessons learned (QGA and debug safety)

QGA command channel is effectively serialized for this workflow.
- Do not launch multiple long qga_exec.py commands in parallel.
- Treat one long-running QGA command as owning the channel until it exits.
For long repros, always run in guest background and poll.
- Start once with guest-side redirection to a fixed log file.
- Persist exit code to a separate file (for example /tmp/<case>.rc).
- Poll with short QGA queries (ps, tail, rc-file check).
Avoid host-shell expansion bugs in QGA payloads.
- Use escaped guest variables (\$log, \$?) or fixed file names.
- Validate generated command string before dispatch.
Logging instrumentation must be null-safe by construction.
- Never add debug prints that dereference pointers before null/ERR checks.
- For bounce/compress pointer transitions, guard with IS_ERR_OR_NULL before any field access.
A crash after adding logs may be caused by the log path itself.
- If PC lands in __dynamic_pr_debug callsite area, suspect log argument dereference first.
- Harden logs, rebuild, and re-run before escalating root-cause claims.
QEMU launch reliability: verify process, not launcher text.
- nohup qemu_start_ori.sh ... may return config banner but still fail to leave a live qemu-system-aarch64 process.
- Always confirm with ps and fallback to a persistent PTY-backed launcher session when needed.
Pixel slider build entrypoint is cwd-sensitive.
- private/google-modules/soc/gs/build_slider.sh must be launched from the pixel/ repo root.
- A failure at tools/bazel: No such file or directory is an invocation-path issue, not a kernel build failure.
- For lane builds, check the wrapper's --dry-run output and inspect that lane's declared dist= directory; do not compare against root out/slider/dist unless the active lane is the root/debug workspace.
Existing query/preserve tooling must survive resume and compression.
- If the current round already has stream capture, inode watchers, field-query helpers, or event-layer artifacts, record them in the checkpoint and reload that checkpoint before widening the search.
- Do not fall back to broad raw-log search merely because the session forgot which helper was already active.