context-clip - SKILL.md Agent Skill

name: context-clip description: Create context-limited versions of evaluations. Use when you need to simulate lower context windows or compare model performance at different context limits. Keywords - context, clip, truncate, window, limit, simulate, 8k, 12k, 16k.

Context Clip Skill

You are helping the user create context-limited versions of existing evaluations using the cohort.py tool.

Workflow

When the user invokes /context-clip <cohort-or-pattern> <eval-pattern> <contexts>, follow these steps:

Important: Use the AskUserQuestion tool for all user confirmations to maintain smooth flow without breaking execution.

1. Parse User Input

Extract from the command:

Cohort pattern: Cohort path or pattern (e.g., Qwen3-Next-80B, data/m12x/Qwen*, or empty for all)
Eval pattern: Description of which eval to process (e.g., "thinking 16k", "fp16", "awq")
Target contexts: One or more context limits (e.g., "12k", "8k", "12k and 8k", "8192", "12288")

Input flexibility:

Cohort can be just the model name (prefix data/m12x/ if needed)
Cohort can be a glob pattern for multiple cohorts
Eval pattern can be any substring or keywords
Contexts can be in K notation (8k, 12k) or raw tokens (8192, 12288)

2. List Available Evals

Discover what evals exist using cohort.py list:

# If cohort specified
python cohort.py list data/m12x/<CohortName>

# If glob pattern
python cohort.py list 'data/m12x/<Pattern>*/'

# If no cohort specified, list all
python cohort.py list

Parse the markdown table output to extract cohort, eval_id, label, and groups for each eval.

3. Match Eval Pattern

Filter the evals list using the user's pattern:

Case-insensitive substring match on: label, cohort, or group values
Show all matches to the user

Use AskUserQuestion:

If 1 match: "Confirm eval?" with the matched eval as default option
If 2-4 matches: Present all as options, multiSelect=false
If 5+ matches: Show list and ask user to be more specific, then re-run

Question format:

Which eval should be context-clipped?
Options:
  [cohort] [eval_id] Label

4. Parse Target Contexts

Convert user's context specifications to token counts:

"8k" → 8192
"12k" → 12288
"16k" → 16384
"32k" → 32768
Already numeric → use as-is

If user specified multiple contexts (e.g., "12k and 8k"), process all of them.

Use AskUserQuestion to confirm:

Create context-limited versions at: 12288, 8192 tokens?
Options:
  - Yes, proceed
  - Modify context values

If "Modify context values", ask again with editable input.

5. Check for Existing Context Variants

Before processing, check if context-limited versions already exist:

python cohort.py list <cohort-path> | grep "ctx"

If found, use AskUserQuestion:

Existing context-limited versions found:
- [eval_id] Label (8k ctx)
- [eval_id] Label (12k ctx)

Options:
  - Skip existing, create missing only
  - Recreate all (will skip if folders exist)
  - Cancel

6. Execute Context Clipping

For each target context, run:

source venv/bin/activate
python cohort.py context <cohort-path> \
  --eval-id <eval_id> \
  --context <context_tokens>

Show progress for each context:

Processing 12288 token limit...
✓ Created 3 result folders, clipped X samples
✓ Updated evals.json

Processing 8192 token limit...
✓ Created 3 result folders, clipped Y samples
✓ Updated evals.json

7. Verify Results

After processing, list the cohort again to show the new evals:

python cohort.py list <cohort-path>

Show the user:

Original eval
New context-limited eval(s)
Their eval_ids for future reference

8. Suggest Next Steps

Inform the user:

Context-limited evaluations created successfully!

To analyze, you'll need to:
  1. Add the cohort to a dataset config (data/*.json)
  2. Rebuild the dataset database
  3. Run analysis commands

Example workflow:
  # If cohort already in a dataset, just rebuild:
  python evaluate.py --dataset data/<dataset>.json

  # Then compare performance across contexts:
  python analyze.py cluster data/<dataset>.json \
    --filters '{"eval_id": ["<original>", "<ctx12k>", "<ctx8k>"]}' \
    --stack sampler

  # Or examine truncation patterns:
  python analyze.py surface data/<dataset>.json \
    --filters '{"eval_id": ["<ctx8k>"], "base_task": "arithmetic"}' \
    --output-dir research/<project>/

Edge Cases

No matches found: Ask user to check pattern or list all evals
Multiple target contexts: Process sequentially, report progress
Multiple cohorts match pattern: Ask user to specify which cohort
Already processed: Skip with warning, or confirm recreation
Source eval has lower context than target: Error - cannot clip to higher context than source
Invalid context value: Must be between 1024 and 131072 (1k-128k)
Cohort not in any dataset: Inform user they'll need to add to a dataset config for analysis

Examples

Example 1: Simple Case

User: /context-clip Qwen3-Next-80B "thinking 16k fp16" 12k

1. Cohort: data/m12x/Qwen3-Next-80B
2. List evals: python cohort.py list data/m12x/Qwen3-Next-80B
3. Found 1 match:
   [bc8eef] Qwen3-Next-80B Thinking (FP16, 16k)
4. Confirm: Yes
5. Target contexts: 12288 tokens
6. Execute:
   python cohort.py context data/m12x/Qwen3-Next-80B --eval-id bc8eef --context 12288
7. ✓ Created 3 result folders, clipped 335+835+1423 samples
8. ✓ New eval_id: 221ec5

Example 2: Multiple Contexts

User: /context-clip Qwen3-Next-80B fp16 "12k and 8k"

1. Cohort: data/m12x/Qwen3-Next-80B
2. List evals: python cohort.py list data/m12x/Qwen3-Next-80B
3. Found 2 matches:
   [bc8eef] Qwen3-Next-80B Thinking (FP16, 16k)
   [5a2b3c] Qwen3-Next-80B Instruct (FP16, 16k)
4. Ask user to pick → [bc8eef]
5. Target contexts: 12288, 8192 tokens
6. Confirm: Yes
7. Execute:
   12k: ✓ Created [221ec5] (335+835+1423 samples clipped)
   8k: ✓ Created [8fc62e] (1314+3717+5998 samples clipped)
8. Show both new eval_ids and suggest next steps

Example 3: Pattern Matching Multiple Cohorts

User: /context-clip "Qwen*" fp16 12k

1. List all: python cohort.py list 'data/m12x/Qwen*/'
2. Found evals in multiple cohorts:
   - Qwen3-32B
   - Qwen3-Next-80B
   - Qwen3-14B
3. Ask user which cohort → Qwen3-32B
4. List evals in that cohort... found 3 fp16 variants
5. Ask user to be more specific
6. User refines: "qwen3-32b fp16 thinking"
7. Found 1 match → confirm → execute

Context Value Guidelines

Common context limits and use cases:

Context	Tokens	Use Case
2k	2048	Extreme resource constraints
4k	4096	Mobile/edge deployment
8k	8192	Standard constrained deployment
12k	12288	Moderate context tasks
16k	16384	Extended reasoning tasks

Guidelines:

Source eval should have higher context than target
Common pattern: 16k → 12k, 8k for comparative analysis
Avoid creating too many variants (stick to 2-3 key points)

Notes

Always work from the ReasonScape root directory
Activate venv before running cohort.py: source venv/bin/activate
Context clipping is non-destructive (creates new folders)
Each context limit gets a unique eval_id
Original result folders are never modified
The tool updates evals.json automatically
New result folders include postprocess metadata in experiment_config.yaml
Cohorts can be used across multiple datasets
After clipping, you may need to add the cohort to a dataset config for analysis
Be explicit about what you're doing at each step
If uncertain about anything, ask the user before proceeding