colab-pipeline-inspection - SKILL.md Agent Skill

name: colab-pipeline-inspection description: Use when running pipeline inspection with the real generator model on a Colab GPU — iterating D prompts, testing unit-split strategies, or trying different G checkpoints (any inspection that needs the full model on GPU, not the API proxy used by scripts/pipeline_inspection.py).

Colab Pipeline Inspection

The local scripts/pipeline_inspection.py uses an API proxy (fast, cheap, inspection-only). To inspect with the real generator model on a GPU — whichever model is under investigation — use colab-cli to provision a T4 session, load the model once, then iterate live without restarting.

G is never hardcoded: the setup script builds it from config.generator, so swapping the model under test is just editing config.generator["model_id"] — the same live-edit pattern used for D prompts below.

Critical Constraints

Rule	Why
`--timeout 600` for model load	Default `exec` timeout is 10 s; model takes 3-5 min
Never use `colab repl` / `colab console` interactively	Both require a TTY and hang in agent context
Pipe stdin for all iterative code: `echo "..." \| colab exec`	Only mode that works headlessly
Kernel state persists across `exec` calls	Load G once; rebuild D fast with stdin snippets
`colab upload` requires parent dir to exist on VM	Create dir first via kernel exec if needed
Always `colab stop` when done	Idle VMs burn compute units

Phase 1: Setup (one-time, ~7 min)

# Provision T4 GPU session
colab new -s inspect --gpu T4

# Inject secrets from local env (never hardcode)
echo "import os; os.environ['HF_TOKEN']='${HF_TOKEN}'; os.environ['GEMINI_API_KEY']='${GEMINI_API_KEY}'" \
  | colab exec -s inspect --timeout 30

# Run setup script — clones repo, installs deps, loads G + config + questions + D
# After this finishes, kernel holds: generator, config, questions, decision_model, run_pipeline_inspection
colab exec -s inspect -f scripts/colab_inspect_setup.py --timeout 600

Verify setup completed:

echo "print(f'G={config.generator[\"model_id\"]}, Q={len(questions)}, D-prompt={config.decision[\"prompt_version\"]}')" \
  | colab exec -s inspect --timeout 15

Swap the G model under test (no file edits):

echo "
from dataclasses import replace
config.generator['model_id'] = 'some-org/some-other-model'
generator = create_generator_from_config(config.generator, config.generation, max_units_per_batch=2)
print('G ready:', config.generator['model_id'])
" | colab exec -s inspect --timeout 600

Phase 2: Iterative Inspection (fast, no model reload)

Run inspection on a question

mkdir -p output/pipeline_inspection
echo "
rows = run_pipeline_inspection(
    question=questions[3],
    generator=generator,
    decision_model=decision_model,
    config=config,
)
" | colab exec -s inspect --timeout 180 | tee output/pipeline_inspection/q3_base.txt

Swap D prompt and re-run

# Edit prompt locally, then upload (parent dir already exists from git clone)
colab upload -s inspect prompts/my-new-prompt.txt /content/reasoning-pruning/prompts/my-new-prompt.txt

# Rebuild D — fast, no model reload
echo "
config.decision['prompt_version'] = 'my-new-prompt'
decision_model = create_decision_model_from_config(
    config.decision, config.pruning, prompts_dir='/content/reasoning-pruning/prompts'
)
print('D ready:', config.decision['prompt_version'])
" | colab exec -s inspect --timeout 30

# Run inspection and capture
echo "
rows = run_pipeline_inspection(question=questions[3], generator=generator, decision_model=decision_model, config=config)
" | colab exec -s inspect --timeout 180 | tee output/pipeline_inspection/q3_my-new-prompt.txt

Change unit-split strategy

echo "
from dataclasses import replace
config = replace(config, unit_split_strategy='clauses')
print('unit_split_strategy:', config.unit_split_strategy)
" | colab exec -s inspect --timeout 15

Run multiple questions

echo "
for qi in [1, 3, 5, 7]:
    print(f'\n=== Q{qi} ===')
    run_pipeline_inspection(question=questions[qi], generator=generator, decision_model=decision_model, config=config)
" | colab exec -s inspect --timeout 600 | tee output/pipeline_inspection/multi_run.txt

Phase 3: Save Output

Stdout from colab exec is the inspection output — always tee to output/pipeline_inspection/.

To download a file written on the VM:

colab download -s inspect /content/reasoning-pruning/output/result.json output/pipeline_inspection/result.json

Export full session history as markdown:

colab log -s inspect -o output/pipeline_inspection/session.md

Phase 4: Cleanup

colab stop -s inspect

Relationship to the notebook

scripts/colab_inspect_setup.py is the headless equivalent of the browser notebook's setup cells (notebooks/data_creation_playground.ipynb): both load the config, build G and D, load questions, and call the same loop entry point run_pipeline_inspection / build_rows_for_question. Each colab exec stdin snippet is the equivalent of running one notebook cell against the persistent kernel. Humans use the notebook in a browser; agents use this script + stdin snippets. Keep the two in sync when the library API changes (Notebook Alignment Rule).

(The setup script builds G config-driven via create_generator_from_config so any model works; the notebook still constructs its generator inline — making it config-driven too is a worthwhile follow-up for full model-variety parity.)

Troubleshooting

Problem	Fix
`exec` times out immediately	Add `--timeout 600` — default is 10 s
"Session not found"	`colab sessions` to check; re-run Phase 1 if pruned
`repl` / `console` hangs	Needs TTY — always pipe stdin instead
Kernel deadlocked	`colab restart-kernel -s inspect`; re-run model-load block
Upload 500 error	Parent dir doesn't exist on VM — create it first via exec
`create_decision_model_from_config` not defined	Run setup or import it: `echo "from reasoning_pruning.clients import create_decision_model_from_config" \| colab exec`
G produces wrong output	Verify HF_TOKEN: `echo "import os; print(os.environ.get('HF_TOKEN','MISSING')[:8])" \| colab exec`