name: kernel-debug-orchestrator description: Orchestrate iterative Linux kernel debugging in learn_os style workspaces by chaining QEMU bring-up, targeted instrumentation, build, test execution, and evidence-driven next-iteration decisions. Use this when tasks are kernel debugging loops (for example writeback/f2fs/fscrypt crashes), and you need a repeatable cycle that depends on f2fs-qemu-agent-pipeline plus kernel-log-instrumentor.
Kernel Debug Orchestrator
Use this skill for end-to-end kernel debug loops where you must repeatedly:
- start/verify guest runtime,
- add or refine kernel logs,
- build,
- run reproduction tests,
- analyze logs,
- update instrumentation and repeat.
Hard dependencies (fixed)
Always load these skills in this order:
f2fs-qemu-agent-pipelinekernel-log-instrumentor
Optional third skill based on test type:
xfstests-qga-ubuntuwhen the reproduction is xfstests/QGA-only.
Do not skip the first two dependencies for this skill.
Scope and assumptions
- Workspace resembles
learn_oswith.vars.shand QEMU helpers under.agents/tools/. - Kernel source is a git repo (for temporary debug branch + one reversible commit).
- Guest access may use SSH or QGA; choose per
f2fs-qemu-agent-pipelinepolicy.
Workflow Contract
Main Workflow
- Normalize context and identify the source tree,
O=build tree, target subsystem, and reproducer. - Verify runtime prerequisites that gate the target path before spending time on workload debugging.
- Decide whether this round is sample-first, static-first, or log-first, and record that choice in the iteration output.
- Apply or refine instrumentation, then build the changed kernel.
- Run the reproducer and capture bounded evidence.
- Convert long-running status and notable transitions into compact event-layer artifacts before broad analysis.
- Correlate evidence and choose the smallest next iteration.
Decision Table
| Phase | Trigger / Symptom | Action | Verify | On Failure | Workflow Effect |
|---|---|---|---|---|---|
| Preflight | Repro depends on post-write fs-verity enablement, verity reads, or verity-tagged writeback coverage | In the guest, verify CONFIG_FS_VERITY=y from /proc/config.gz or /boot/config-*, then run a fresh plain-file f2fs_io set_verity probe on a fresh mkfs.f2fs -O encrypt,verity image mounted with the intended options |
Guest config shows CONFIG_FS_VERITY=y and the plain-file probe succeeds |
Treat the run as blocked by kernel config/runtime prerequisites; enable the config in the host build output, rebuild, reboot, and re-run preflight before touching workload logic | block |
Output Contract
- phase reached:
- evidence mode:
- decision path taken:
- verification evidence:
- checkpoint path:
- fallback used:
- unresolved blocker:
- next workflow step:
Orchestration loop
Step 0: Normalize context
- Source
.vars.sh. - Identify
source tree(e.g.$BASE/f2fs) andO=build tree (e.g.$BASE/f2fs_upstream). - Confirm target function(s), subsystem, and reproduction script.
Step 1: Boot and verify VM readiness
- Start QEMU in non-blocking/reusable way per
f2fs-qemu-agent-pipeline. - Verify process, control plane reachability (QGA/SSH), required mounts, and test paths.
- Record command + status + evidence path.
Step 2: Instrumentation planning
Use kernel-log-instrumentor rules:
- create a temporary debug branch,
- keep logs in one commit,
- default to
pr_debug+ dynamic_debug, - include
__func__and stable log prefix. - if the user needs to follow one shared object across many threads, explicitly switch to table-friendly
k=vlogging and plan the query commands up front.
Guardrail: new log lines must not introduce fresh pointer dereference risk. Prefer printing raw pointers/flags/scalars first.
Step 3: Apply instrumentation
- Patch minimal callsites around suspected failure path (entry, branch decision, error path, state transition).
- For concurrency/state counters, log before/after updates with clear labels.
Step 4: Build verification
- Object-level checks first (for changed
.cfiles, usekobjfrom.vars.sh). - Then full image build (
bash $SCRIPT/make_upstream.sh). - Report exact build log path and pass/fail.
Pixel/Slider variant:
- If the workspace uses
private/google-modules/soc/gs/build_slider.sh, run it from thepixel/repo root, not fromprivate/google-modules/soc/gs/. - Reason: the script executes
tools/bazelvia a cwd-relative path; invoking it from the subdirectory fails before compilation withtools/bazel: No such file or directory. - If using a repo-root lane wrapper such as
./build_slider.sh --lane my_dec, first run./build_slider.sh --lane my_dec --dry-runand recordworkspace=,common=,output_root=, anddist=. - For
my_dec, the source should resolve throughout/workspaces/slider_my_dec/common -> common_my_dec, while final images land underout/workspaces/slider_my_dec/out/slider/dist, not the rootout/slider/dist. - If Bazel dies during server startup with
channel not registered to an event loop, classify it as a build-environment blocker first, not a source compile result. Capture theout/bazel/.../server/jvm.outpath in the report. - For compile-only sanity checks, prefer
tools/bazel build ...(notbuild_slider.sh, which is abazel runwrapper). Reference:references/pixel-slider-bazel-build.md. - For “make a new Kconfig symbol land in boot.img via Kleaf fragments”, follow:
references/pixel-kleaf-config-fragment-bootimg.md.
Step 5: Run reproduction
Support two reproduction modes:
- Existing script mode:
- run the known test script (e.g.
rw_matrix.sh) with deterministic env.
- run the known test script (e.g.
- On-demand script mode:
- create a minimal one-off reproducer under
$TESTwhen no usable script exists.
- create a minimal one-off reproducer under
For long tests via QGA:
- redirect output to guest file,
- tail/log-scan separately,
- never equate QGA timeout with test failure until process/log state is checked.
- if the run is long enough to survive context compression or handoff, write a compact checkpoint before expanding analysis.
Step 6: Collect and correlate evidence
Collect at minimum:
- test stdout/stderr log,
- kernel console log (
guest_console.log), - filtered debug lines by stable prefix,
- first failure stack trace and surrounding window.
Correlation requirement:
- identify the last N debug lines before first Oops/BUG,
- map to function + decision branch + key state fields.
- when logs are table-friendly, query them as structured rows before doing free-form reading:
- generic:
python3 /home/nzzhao/.agents/skills/kernel-log-instrumentor/scripts/kernel_log_kv_query.py <log> ... - existing F2FS WBDBG/sysrq logs:
bash /home/nzzhao/.agents/skills/kernel-log-instrumentor/scripts/f2fs_log_field_query.sh <log> ...
- generic:
- when the local workspace has the phase-1 event-layer helpers, prefer:
python3 /home/nzzhao/learn_os/scripts/kernel_log_chain_packet.py ...python3 /home/nzzhao/learn_os/scripts/kernel_debug_emit_job_event.py ...python3 /home/nzzhao/learn_os/scripts/kernel_debug_write_checkpoint.py ...
- if a background job already emits
heartbeat.json,event.json, orcheckpoint.md, consume those first and only escalate to raw streams when the packet/checkpoint reports a missing edge. - when
sysrqdumps are present, run pid/ino correlation script:/home/nzzhao/learn_os/scripts/f2fs_pid_ino_correlate.sh <kernel_stream.txt> 3 40- use output to align blocked
pidwith nearby[WBDBG] pid/inoactivity. - prioritize comm-cluster evidence (
PackageManager*,android.bg,android.io) when same-pid WBDBG is sparse.
Step 7: Decide next iteration
Classify and act:
insufficient signal: add/refine logs and repeat from Step 2.clear root-cause candidate: propose fix patch and validate with same repro.
References
- /home/nzzhao/.agents/skills/kernel-debug-orchestrator/references/f2fs-largefolio-gc-porting-playbook.md
references/pixel-kleaf-config-fragment-bootimg.mdfor wiring a new Kconfig symbol intoboot.imgin Pixel/Kleaf.references/git-cross-repo-porting-format-patch.mdfor porting a commit across unrelated repos withgit format-patch/git am.non-deterministic: tighten repro and add ordering/state logs.
Keep each iteration bounded; avoid broad logging expansion without evidence.
Output contract per iteration
Use this structure every loop:
iteration: integergoal: what this round tries to prove/disproveevidence mode:sample-first/static-first/log-firstinstrumentation: files/functions and whybuild: command + status + log pathrepro: command + status + log pathevent artifacts: heartbeat/event/checkpoint/log-chain-packet paths when they existfindings: concrete evidence and inferencenext action: smallest high-value next step
Stop conditions
Stop loop only when one is true:
- root cause is evidenced with high confidence and patch direction is clear, or
- current instrumentation cannot progress and a specific external dependency is missing.
If blocked, report exact blocker and minimal unblock command.
Reference playbooks
references/f2fs-write-end-io-playbook.mdfor writeback/compression/folio-private style crashes.references/f2fs-pid-ino-correlation-playbook.mdfor sysrq blocked-stack and WBDBG pid/ino correlation.
Lessons learned (QGA and debug safety)
- QGA command channel is effectively serialized for this workflow.
- Do not launch multiple long
qga_exec.pycommands in parallel. - Treat one long-running QGA command as owning the channel until it exits.
- Do not launch multiple long
- For long repros, always run in guest background and poll.
- Start once with guest-side redirection to a fixed log file.
- Persist exit code to a separate file (for example
/tmp/<case>.rc). - Poll with short QGA queries (
ps,tail, rc-file check).
- Avoid host-shell expansion bugs in QGA payloads.
- Use escaped guest variables (
\$log,\$?) or fixed file names. - Validate generated command string before dispatch.
- Use escaped guest variables (
- Logging instrumentation must be null-safe by construction.
- Never add debug prints that dereference pointers before null/ERR checks.
- For bounce/compress pointer transitions, guard with
IS_ERR_OR_NULLbefore any field access.
- A crash after adding logs may be caused by the log path itself.
- If PC lands in
__dynamic_pr_debugcallsite area, suspect log argument dereference first. - Harden logs, rebuild, and re-run before escalating root-cause claims.
- If PC lands in
- QEMU launch reliability: verify process, not launcher text.
nohup qemu_start_ori.sh ...may return config banner but still fail to leave a liveqemu-system-aarch64process.- Always confirm with
psand fallback to a persistent PTY-backed launcher session when needed.
- Pixel slider build entrypoint is cwd-sensitive.
private/google-modules/soc/gs/build_slider.shmust be launched from thepixel/repo root.- A failure at
tools/bazel: No such file or directoryis an invocation-path issue, not a kernel build failure. - For lane builds, check the wrapper's
--dry-runoutput and inspect that lane's declareddist=directory; do not compare against rootout/slider/distunless the active lane is the root/debug workspace.
- Existing query/preserve tooling must survive resume and compression.
- If the current round already has stream capture, inode watchers, field-query helpers, or event-layer artifacts, record them in the checkpoint and reload that checkpoint before widening the search.
- Do not fall back to broad raw-log search merely because the session forgot which helper was already active.