raven-research-looped-reasoning-eval - SKILL.md Agent Skill

name: raven-research-looped-reasoning-eval description: Scaffold a research-grade experiment that tests whether a Looped / Recurrent-Depth Transformer (RDT) is better than a parameter-matched vanilla transformer at vulnerability discovery, when both are continued-pretrained on a security corpus. Use when the user says "test the looped-transformer hypothesis", "run the OpenMythos RDT experiment", "compare loop count to trial budget", "does inference-time loops help security reasoning", or asks for a reproducible research pipeline around the Mythos architecture hypothesis. The skill scaffolds the experiment — it does NOT make claims about Mythos itself, and it explicitly bins OpenMythos as a community speculative reconstruction, not Anthropic's actual model.

Raven — Looped-Reasoning Vulnerability-Discovery Research Eval

This skill is a research scaffold, not a production capability. It exists to answer one question, honestly:

Does inference-time recurrence in a Recurrent-Depth Transformer (RDT) improve vulnerability-discovery quality over a parameter-matched fixed-depth transformer, when both are continued-pretrained on the same security corpus?

The answer matters because the OpenMythos community speculation (kyegomez/OpenMythos) and the Mythos Preview public record both suggest depth-via-looping as a likely contributor to Mythos's strong code-reading. If the hypothesis holds at small scale on a defender-relevant benchmark, Raven gets a defensible architectural bet. If it fails, Raven has a publishable negative result and stops chasing the wrong rabbit.

This skill is honest about three boundaries:

OpenMythos is an independent reconstruction (its own README disclaimer). The experiment does not validate Mythos. It validates a property the reconstruction predicts.
The experiment runs at small scale (3B params). Negative results at small scale do not refute large-scale claims. Positive results at small scale do not prove them either — they are evidence of a trend.
No checkpoint is publicly trained yet. The skill scaffolds pretraining, continued pretraining, and evaluation from scratch.

When to use

Trigger when the user asks any of:

"Run the looped-transformer research experiment"
"Test OpenMythos at 3B on a security corpus"
"Does loop count behave like a trial budget"
"Compare RDT vs vanilla on vulnerability-discovery quality"
"Scaffold the architecture-bet experiment for Raven"

Do not trigger for: production runs (this is research-only), grant deliverables (use raven-d3fend/positioning.md instead), or claims that imply Raven actually uses a Mythos-class model.

Honest framing — what this experiment can and cannot tell you

Question	Can this experiment answer?
Does more inference-time loops help vulnerability reasoning at 3B scale?	Yes
Does the loop-count budget behave like a trial budget (Cybergym pattern)?	Yes
Is RDT-3B better than vanilla-3B at the same FLOPs per forward?	Yes
Does this prove Mythos uses looped reasoning?	No — it is consistent evidence, not proof
Does this scale to 100B+?	Unknown — Parcae paper provides scaling laws, but does not prove them at frontier
Does this make Raven Mythos-class?	No — Raven is the harness, the model is one component

Inputs

Required:

Compute budget — gpu_hours (default 200 H100-hours for full experiment), max_train_tokens (default 30B per model)
Security corpus paths — must point at workspace files; the skill validates corpus integrity before training
Output dir — where checkpoints, eval logs, and the final research report land

Optional:

Model variant — open_mythos_3b (default), open_mythos_1b (smoke test), open_mythos_10b (if compute permits)
Vanilla baseline — llama_3b or qwen_3b parameter-matched architecture
Loop counts to evaluate — default [1, 2, 4, 8, 16, 32]
Random seeds — default [42, 1337, 2026] for replication

Phase 0 — Reproducibility & honest-bookkeeping setup (mandatory)

Before any training, the skill writes a manifest.json to the output dir containing:

Git commit SHAs of: this skill, OpenMythos (pinned), the corpus build scripts
SHA-256 of every corpus file
GPU hardware (model + driver version + CUDA + PyTorch versions)
Random seeds
All hyperparameters (read directly from MythosConfig)
Expected wall-clock and compute cost
Explicit statement: "This experiment does not validate Claude Mythos Preview. It tests a property predicted by the OpenMythos community reconstruction."

A run without manifest is invalid. The skill refuses to start training if the manifest cannot be written or if any corpus file's SHA does not match the pre-registered value.

Phase 1 — Security corpus assembly

The corpus is what makes the result defender-relevant. FineWeb-Edu (OpenMythos's default training set) is generic web text — useless for security reasoning. The skill assembles a layered security corpus:

Layer	Source	Approximate tokens	License
L1 — Base	FineWeb-Edu `sample-10BT`	10B	ODC-BY
L2 — Code	The Stack v2 (filtered to C, C++, Rust, Solidity, Python, Go)	8B	Various — verify per-language
L3 — Vuln narratives	NVD CVE descriptions + linked references + advisory text 2010–2026	0.5B	CC-BY / NIST public
L4 — Security commits	Linux kernel + OpenSSL + curl + sqlite + nginx security-tagged commit messages + diffs	1.5B	GPL / various — research use
L5 — Smart-contract audits	Public Solana / EVM audit reports (Trail of Bits, OpenZeppelin, etc.)	0.3B	Per-publisher CC-BY where stated; skip otherwise
L6 — Detector rules	YARA, Semgrep, ARES, Sigma rule packs as plain text	0.1B	Various FOSS

The skill enforces:

Per-source license check; entries without verifiable license are dropped, not used.
Deduplication via SimHash at 64-bit threshold 4.
Per-layer SHA-256 manifest.
No CTF writeups, no exploit-code-with-shellcode (Raven defender edition). Filter via regex blocklist and a 𝓜 classifier from raven/ml/zero_day_detector.py running in inverse mode (block exploit-content).

Total ~20B tokens at first build. Optionally extend to 50B by widening L2.

Phase 2 — Pretrain / continued-pretrain matrix

Three model configurations, all parameter-matched at 3B active params:

Config	Architecture	Loop iterations (train)	Loop iterations (eval)
A — RDT-mythos	OpenMythos 3B (`mythos_3b()` config)	mean 8, sampled per Parcae	swept `[1, 2, 4, 8, 16, 32]`
B — RDT-static	Same OpenMythos 3B, ACT halting disabled, fixed loops	fixed 8	swept `[1, 2, 4, 8, 16, 32]`
C — Vanilla	LLaMA-3 / Qwen-3 architecture, 3B params, depth tuned for FLOP parity per forward at loop=8	n/a (fixed depth)	n/a

Why three configs:

A vs C — does looped beat vanilla at all?
A vs B — does Parcae adaptive halting help over fixed loops?
B vs C — does looping itself help, even without ACT?

Train each config on the assembled corpus for 30B tokens, 3 seeds each. Total 9 runs. Use the training/3b_fine_web_edu.py script from OpenMythos as a starting point and modify the dataset loader to consume the layered security corpus.

Mandatory training-time invariants:

Log spectral radius ρ(A) every 100 steps for configs A and B. Run is invalidated if ρ(A) ≥ 1 at any point — this is the Parcae stability constraint the OpenMythos README emphasizes.
Log loss every 10 steps; the skill auto-detects loss spikes (>3σ above rolling mean) and saves checkpoints around them for forensic analysis.
Save checkpoints at 1B, 5B, 10B, 20B, 30B tokens for every run.

Phase 3 — Evaluation suite

Five defender-relevant evals. The skill explicitly does NOT include offensive evals — no CTF execution, no exploit construction. Defender edition.

Eval 3.1 — CVE-aware code completion accuracy (𝓜-graded)

Hold-out 500 NVD entries with linked patches from 2025–2026 (not in training). For each:

Show the model the pre-patch code
Ask: "Identify the vulnerability class (CWE) and the location"
Grade with a 𝓜 classifier that compares model output to ground truth CWE + line range

Metric: accuracy at CWE-class + within-10-line localization. Compute curve over loop count for A and B.

Eval 3.2 — Variant analysis recall (𝒯-graded)

500 pairs of (seed CVE, target codebase containing a variant). The model is asked to identify the variant. Grade with raven/ml/variant_analyzer.py as the tool oracle.

Metric: recall@k for k ∈ {1, 5, 10}. Curve over loop count.

Eval 3.3 — Threat-model spirit-vs-letter recovery (𝓛-graded with falsification)

100 hand-crafted PR diffs that pass naive lint but violate an explicit threat-model invariant. The model gets the diff + the THREAT_MODEL.yaml. Grade by whether it surfaces a letter_to_spirit_recovery finding as defined in raven-devsec-mythos-class.

Metric: recovery rate. This is the key defender-relevant test — it directly measures the XBOW-documented Mythos weakness (XBOW Mythos evaluation) and whether loops help mitigate it.

Eval 3.4 — Cost vs accuracy frontier (Cybergym-aligned)

Following the Anthropic AI for Cyber Defenders post Cybergym methodology: pick a fixed accuracy target, ask each (config, loop_count) pair what it costs to reach it. Plot pareto frontier.

The hypothesis to test: loop count behaves like trial count — i.e., the curve for config A at loop=8 should look like the curve for config C at trials=8. If it does, Raven gets a unified budget abstraction.

Eval 3.5 — Severity-agreement against human experts

A small held-out set of 50 findings with senior-security-engineer severity labels. Each config's top-1 verdict is compared against expert label. Match the Anthropic red team writeup's 89%/98% metric format.

Phase 4 — Statistical analysis

For every metric:

Compute mean ± 1.96 × stderr across 3 seeds.
Run paired bootstrap (10k resamples) for every config-pair comparison.
Apply Benjamini–Hochberg correction across the 5 evals × 3 config-pair comparisons = 15 hypotheses (FDR=0.05).
Report effect sizes (Cohen's d), not just p-values.

A "positive" result requires: BH-corrected significance AND effect size d ≥ 0.5 AND the loop-count curve is monotonically improving up to the saturation point predicted by Parcae.

Anything less is reported as "consistent with no effect" — the skill REFUSES to spin marginal results as positive.

Phase 5 — Honest-report generation

Auto-generate report.md with:

# Looped-Reasoning Vulnerability-Discovery Eval — <date>

## Headline result
- Verdict: <SUPPORTS_HYPOTHESIS | INCONCLUSIVE | REFUTES_HYPOTHESIS>
- Confidence: <high | medium | low>
- One-line summary: <verbatim from the analyst>

## Honest disclaimers (mandatory)
1. OpenMythos is a community reconstruction, not Anthropic's Claude Mythos Preview.
2. This experiment was run at 3B scale. Results do not necessarily generalize to frontier scale.
3. We do not claim Raven is "Mythos-class" based on this result. Raven is a defender harness; the model is one component.

## Compute & reproducibility
- Total GPU-hours: <n>
- Wall clock: <h>
- Random seeds: [42, 1337, 2026]
- Commit SHAs: <list>
- Corpus SHA: <hash>
- All checkpoints: <s3 path / disk path>

## Eval results
<5 sections, one per eval, with table + curve description>

## Statistical defense
<bootstrap CI, effect sizes, BH-corrected p-values>

## What this means for Raven
- If SUPPORTS: candidate integration into CDP — loop count becomes a unified budget primitive across `raven-zero-day-hunter` trial count, `raven-devsec-mythos-class` verification depth, and `raven-zero-day-detection` sensitivity.
- If INCONCLUSIVE: park architecture bet; revisit at 10B scale or with larger corpus.
- If REFUTES: publish negative result, drop looped-reasoning from Raven's roadmap.

## What this means for the OpenAI grant submission
The grant submission does NOT depend on this result. Per raven-d3fend/positioning.md, Raven's grant pitch is CDP grounding + D3FEND OWL + defender-first — architectural choices are research-track, not pitch-track.

The skill refuses to ship a report that omits the "Honest disclaimers" section.

Phase 6 — Publishable artifact

If the result is publishable (SUPPORTS or REFUTES with high confidence), the skill scaffolds:

An arXiv-ready LaTeX file with the eval methodology and results
The full corpus manifest (SHA-256 + license per file)
A reproducibility kit: dockerfile + requirements pin + a single make all to re-run the experiment
An OpenReview submission template targeting a security venue (USENIX Security, NDSS, IEEE S&P) with a defender-applications track

The skill explicitly does NOT scaffold a "Raven is Mythos-class" marketing piece. That would be dishonest.

Refusal rules

Refuse to start training without a complete manifest + corpus SHA verification.
Refuse to ship a report without the Honest disclaimers section verbatim.
Refuse to claim the experiment validates Claude Mythos Preview itself.
Refuse to include CTF/exploit-construction evals. Defender edition.
Refuse to spin marginal results as positive — INCONCLUSIVE is a valid honest verdict.
Refuse to gate any Raven production behavior on this experiment until it has been independently replicated at ≥ 10B scale.
Refuse to skip Eval 3.3 (spirit-vs-letter) — it is the defender-relevant test, and the experiment is meaningless without it.

Compute budget reference

A rough estimate for the full 9-run experiment at 3B × 30B tokens:

Item	Cost
Training (9 runs × 30B tokens × 3B params)	~150 H100-hours
Evaluation (5 evals × 3 configs × 6 loop counts × 3 seeds)	~30 H100-hours
Buffer for restarts, corpus rebuild, debugging	~20 H100-hours
Total	~200 H100-hours

At commercial rates (Lambda, Voltage Park, etc., May 2026) this is in the low-thousands-USD range. The skill calls this out explicitly so the user can decide whether the architectural bet is worth that spend.

Related skills & docs

raven-devsec-mythos-class — production skill that the spirit-vs-letter eval (3.3) targets. Improvements there are the practical payoff if the experiment supports the hypothesis.
hermes-mythos-persona.md — persona overlay that uses the behavioral properties of Mythos at the prompt level, independent of architecture. Always available, no experiment required.
raven-d3fend/positioning.md — companion grant-positioning document; explicitly chooses Opsi C (skip OpenMythos from grant submission) and explains why this experiment is research-track only.
OpenMythos repo — kyegomez/OpenMythos, MIT, ~12.7K stars as of May 14 2026, community speculative reconstruction.

Provenance — sources this skill is built on

kyegomez/OpenMythos repo + README — architecture being tested
Parcae paper — Scaling Laws for Stable Looped Language Models — stability via ρ(A) < 1, scaling laws
Saunshi et al. 2025 — Reasoning with Latent Thoughts on the Power of Looped Transformers — theoretical basis for loop-as-CoT
Anthropic red team Mythos writeup — methodology to match in evals 3.5
Anthropic AI for Cyber Defenders / Cybergym — trial-budget methodology for eval 3.4
XBOW Mythos evaluation — letter-vs-spirit weakness; defines eval 3.3
imiel.dev system card breakdown — Mythos's 89%/98% severity-agreement format