autoresearch-pipeline-for-ai-safety-research

star 2

Skill for AI agent capabilities

hiyenwong By hiyenwong schedule Updated 6/3/2026

name: autoresearch-pipeline-for-ai-safety-research description: Skill for AI agent capabilities

Autoresearch Pipeline for AI Safety Research

Overview

Source: arXiv:2603.24511v1 (Claudini) Utility: 0.95 Topic: LLM agent autonomous research for discovering new algorithms Key Contribution: Autoresearch pipeline achieves SOTA results in adversarial attack discovery

Activation Keywords

  • autoresearch pipeline
  • LLM agent autonomous research
  • automated AI safety research
  • iterative algorithm discovery
  • Claude Code research automation

Core Innovation

Problem

  • AI safety research often manual and slow
  • Existing methods provide good starting points but optimization needed
  • Dense quantitative feedback available but not leveraged

Solution

Autoresearch Pipeline:

  1. Start from existing implementations - Strong baseline (e.g., GCG)
  2. LLM agent iteration - Claude Code explores modifications
  3. Quantitative evaluation - Attack success rate (ASR) feedback
  4. Discover new algorithms - SOTA results achieved

Key Results

Target Model New Algorithm Best Baseline Improvement
GPT-OSS-Safeguard-20B 40% ASR ≤10% ASR +30%
Meta-SecAlign-70B 100% ASR 56% ASR +44%

Pipeline Architecture

Existing Methods → LLM Agent Exploration → Iterative Refinement → Evaluation → New Discovery
      ↓                    ↓                      ↓                ↓           ↓
   Baseline          Code Modification      Algorithm Changes    ASR Test    SOTA Results

Implementation Framework

class AutoresearchPipeline:
    def __init__(self, baseline_method, evaluation_fn, agent):
        self.baseline = baseline_method
        self.evaluate = evaluation_fn
        self.agent = agent  # Claude Code-like agent
    
    def run(self, n_iterations=100):
        current_algorithm = self.baseline
        
        for i in range(n_iterations):
            # Agent explores modifications
            modifications = self.agent.suggest_modifications(current_algorithm)
            
            # Try each modification
            for mod in modifications:
                new_algorithm = apply_modification(current_algorithm, mod)
                score = self.evaluate(new_algorithm)
                
                if score > best_score:
                    current_algorithm = new_algorithm
                    best_score = score
                    log_discovery(mod, score)
        
        return current_algorithm, best_score

Key Principles

1. Strong Starting Points

  • Existing methods provide foundation
  • Don't start from scratch
  • Leverage prior research

2. Dense Quantitative Feedback

  • Clear optimization objective
  • Measurable outcomes (ASR, accuracy, etc.)
  • Direct feedback drives improvement

3. Agent Capabilities

  • Code generation/modification
  • Literature understanding
  • Creative exploration

4. Iterative Refinement

  • Many small modifications
  • Gradual improvement accumulation
  • Exploration vs exploitation balance

Application Domains

Domain Starting Point Objective Suitability
Adversarial Attacks GCG, AutoPrompt ASR maximization ✅ Excellent
Prompt Optimization Base prompts Task performance ✅ Good
Architecture Search Known architectures Accuracy ✅ Good
Hyperparameter Tuning Default configs Validation score ✅ Good
Algorithm Discovery Existing algorithms Benchmark scores ✅ Excellent

Safety Considerations

⚠️ Important: This pipeline can be used for both defensive and offensive research.

Defensive Applications

  • Discover robust defense mechanisms
  • Identify vulnerabilities before attackers
  • Stress-test safety systems

Offensive Applications

  • Create new attack algorithms
  • Jailbreak safety measures
  • Prompt injection optimization

Recommended Use

  • Prioritize defensive research
  • Use for authorized security testing only
  • Follow ethical guidelines
  • Report findings responsibly

Relation to Self-Evolution

Self-Evolution Concept Autoresearch Pipeline
Learn → Apply → Reflect → Improve Baseline → Modify → Evaluate → Discover
Delegation to Specialists Agent handles code exploration
Dense Feedback Quantitative ASR metrics
Ship or It Doesn't Count Published SOTA algorithms

Implementation for OpenClaw

Potential Applications

  1. Skill Optimization

    • Start from existing skills
    • Agent modifies instructions
    • Evaluate on task performance
  2. Agent Improvement

    • Optimize agent behaviors
    • Discover new workflows
    • Quantitative success metrics
  3. Workflow Discovery

    • Find better processes
    • Optimize existing workflows
    • Task completion metrics

Example: Skill Autoresearch

class SkillAutoresearch:
    def optimize_skill(self, base_skill, evaluation_tasks):
        current_skill = base_skill
        
        for iteration in range(n_iterations):
            # Agent suggests skill modifications
            suggestions = self.agent.analyze_skill(current_skill)
            
            for suggestion in suggestions:
                modified_skill = apply_suggestion(current_skill, suggestion)
                
                # Evaluate on tasks
                performance = evaluate_skill(modified_skill, evaluation_tasks)
                
                if performance > best_performance:
                    current_skill = modified_skill
                    best_performance = performance
        
        return current_skill

Best Practices

  1. Define Clear Objectives - Measurable success metrics
  2. Set Constraints - Safety boundaries, computational limits
  3. Document Discoveries - Track all improvements
  4. Validate Transfers - Test generalization to other contexts
  5. Report Responsibly - Ethical disclosure for security findings

Description

Autoresearch Pipeline for AI Safety Research

Tools Used

  • read - Read documentation and references
  • web_search - Search for related information
  • web_fetch - Fetch paper or documentation

Instructions for Agents

Follow these steps when applying this skill:

Step 1: Start from existing implementations

Step 2: LLM agent iteration

Step 3: Quantitative evaluation

Step 4: Discover new algorithms

Step 5: Skill Optimization

Examples

Example 1: Basic Application

User: I need to apply Autoresearch Pipeline for AI Safety Research to my analysis.

Agent: I'll help you apply autoresearch-pipeline. First, let me understand your specific use case...

Context: Apply the methodology

Example 2: Advanced Scenario

User: Complex analysis scenario

Agent: Based on the methodology, I'll guide you through the advanced application...

Example 2: Advanced Application

User: What are the key considerations for autoresearch-pipeline?

Agent: Let me search for the latest research and best practices...

References


Created: 2026-03-28 Source: arXiv:2603.24511v1 - "Claudini: Autoresearch Discovers SOTA Adversarial Attack Algorithms"

⚠️ Note: Focus on research methodology, not attack details. Use for defensive research only.

Install via CLI
npx skills add https://github.com/hiyenwong/ai_collection --skill autoresearch-pipeline-for-ai-safety-research
Repository Details
star Stars 2
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator