inspect-ai - SKILL.md Agent Skill

name: inspect-ai description: Analyze Inspect AI evaluation logs, understand EvalLog structure, extract samples, events, and scoring data using dataframes user-invocable: false

Inspect AI Log Analysis Reference

Use this knowledge when working with Inspect AI evaluation logs (.eval or .json files).

File Format

.eval files are binary (compressed) format containing JSON data. They are essentially zip archives. To read them programmatically, use the Inspect Python API.

Core Data Structures

EvalLog (Top-level)

class EvalLog:
    version: int                    # File format version (currently 2)
    status: str                     # "started", "success", or "error"
    eval: EvalSpec                  # Task, model, creation time
    plan: EvalPlan                  # Solvers and generation config
    results: EvalResults            # Aggregate scorer metrics
    stats: EvalStats                # Token usage statistics
    error: EvalError | None         # Error details if status="error"
    samples: list[EvalSample]       # Individual sample records
    reductions: list[EvalSampleReduction]  # Multi-epoch reductions

Always check log.status == "success" before analyzing results.

EvalSample (Per-sample data)

class EvalSample:
    id: int | str                   # Unique sample identifier
    epoch: int                      # Epoch number (for multi-epoch runs)
    input: str | list[ChatMessage]  # The prompt/task given to model
    target: str | list[str]         # Expected answer(s)
    choices: list[str] | None       # Multiple choice options if applicable

    # Execution results
    messages: list[ChatMessage]     # Full conversation history
    output: ModelOutput             # Final model output
    scores: dict[str, Score] | None # Scoring results by scorer name

    # Events and state
    events: list[Event]             # Complete transcript of all events
    store: dict[str, Any]           # State at end of execution
    attachments: dict[str, str]     # Referenced content (images, etc.)

    # Metadata
    metadata: dict[str, Any]        # Custom key-value pairs from dataset
    sandbox: SandboxEnvironmentSpec | None  # Sandbox config
    files: list[str] | None         # Files provided to sandbox
    setup: str | None               # Setup script run in sandbox

    # Timing
    started_at: datetime | None
    completed_at: datetime | None
    total_time: float | None        # Wall clock time
    working_time: float | None      # Active processing time

    # Token usage
    model_usage: dict[str, ModelUsage]  # Tokens by model

    # Error/limit info
    error: EvalError | None         # Error that halted sample
    error_retries: list[EvalError] | None  # Retried errors
    limit: EvalSampleLimit | None   # Limit that halted sample (context/time/message/token)

Event Types (for behavioral analysis)

Events are the core of behavioral analysis. Each sample has an events list containing:

Event = Union[
    SampleInitEvent,    # Sample initialization
    SampleLimitEvent,   # Limit reached
    SandboxEvent,       # Sandbox operations (exec, read_file, write_file)
    StateEvent,         # State changes
    StoreEvent,         # Store updates
    ModelEvent,         # LLM API calls
    ToolEvent,          # Tool invocations
    ApprovalEvent,      # Human approval events
    InputEvent,         # User input
    ScoreEvent,         # Scoring events
    ScoreEditEvent,     # Score modifications
    ErrorEvent,         # Errors
    LoggerEvent,        # Log messages
    InfoEvent,          # Info messages
    SpanBeginEvent,     # Span start (for timing)
    SpanEndEvent,       # Span end
    StepEvent,          # Solver step events
    SubtaskEvent,       # Subtask events
]

ModelEvent (LLM calls)

class ModelEvent:
    event: "model"
    model: str                      # Model name
    input: list[ChatMessage]        # Messages sent to model
    tools: list[ToolInfo]           # Available tools
    tool_choice: ToolChoice         # Tool selection directive
    config: GenerateConfig          # Generation parameters
    output: ModelOutput             # Model response
    retries: int | None             # API retries
    error: str | None               # Error if failed
    cache: "read" | "write" | None  # Cache hit/miss
    timestamp: datetime             # When call started
    completed: datetime | None      # When call finished
    working_time: float | None      # Processing time

ToolEvent (Tool calls)

class ToolEvent:
    event: "tool"
    id: str                         # Unique tool call ID
    function: str                   # Tool/function name
    arguments: dict[str, JsonValue] # Arguments passed
    result: ToolResult              # Return value
    error: ToolCallError | None     # Error if failed
    truncated: tuple[int, int] | None  # If output was truncated
    timestamp: datetime             # When call started
    completed: datetime | None      # When call finished
    working_time: float | None      # Processing time
    agent: str | None               # Agent name if handoff
    failed: bool | None             # Hard failure flag

Score

class Score:
    value: float | str | int | bool | list  # The score value
    answer: str | None              # Model's answer extracted
    explanation: str | None         # Explanation of score
    metadata: dict[str, Any] | None # Additional scoring metadata
    history: list[ScoreEdit]        # Edit history (history[0] = original)

Dataframe API (Primary Analysis Method)

The inspect_ai.analysis module provides functions to convert logs into Pandas dataframes.

evals_df() - One row per evaluation

from inspect_ai.analysis import evals_df

df = evals_df("logs")  # Read all logs in directory
df = evals_df(["path/to/file1.eval", "path/to/file2.eval"])  # Specific files

Default columns (~51):

eval_id - Unique evaluation identifier
log - URI of source file
task, task_version, task_file, task_arg_* - Task info
model, model_args, generate_config_* - Model info
status, error - Completion status
score_<scorer>_<metric> - All scores expanded as columns
samples_completed, samples_total
created, git_commit, tags, metadata_*

Pre-built column groups:

from inspect_ai.analysis import (
    EvalInfo,      # created, tags, metadata, git
    EvalTask,      # task name, file, args, solver
    EvalModel,     # model name, args, generation config
    EvalDataset,   # dataset name, location, sample IDs
    EvalConfig,    # epochs, approval, sample limits
    EvalResults,   # status, errors, samples completed
    EvalScores,    # all scores as separate columns
    EvalColumns,   # all of the above (~50 columns)
)

samples_df() - One row per sample

from inspect_ai.analysis import samples_df, SampleSummary, SampleScores, SampleMessages

# Fast read (summaries only, 12 columns)
df = samples_df("logs")

# With detailed scores
df = samples_df("logs", columns=SampleSummary + SampleScores)

# With message content (slower, loads full samples)
df = samples_df("logs", columns=SampleSummary + SampleMessages)

SampleSummary columns (default, 12 columns):

sample_id - Globally unique identifier
eval_id - Links to evaluation
id, epoch - Sample ID within eval and epoch number
input, target - Task input and expected output
metadata_* - Expanded metadata dictionary
score_* - Score values only
model_usage - Token counts
total_time, working_time - Timing data
error, limit, retries - Failure info
log - Source file URI

SampleScores adds:

Score answer, explanation, metadata

SampleMessages adds:

Full message content (requires loading full sample)

messages_df() - One row per message

from inspect_ai.analysis import messages_df

# All messages
df = messages_df("logs")

# Filter by role
df = messages_df("logs", filter=["assistant"])
df = messages_df("logs", filter=["user", "assistant"])

# Custom filter function
df = messages_df("logs", filter=lambda msg: "error" in msg.content.lower())

Default columns:

sample_id, eval_id - Links to sample and evaluation
event_id - Unique message identifier
role - user, assistant, system, tool
content - Message text
source - Origin of message
tool_calls - Formatted function calls
tool_call_id, tool_call_function, tool_call_error
log - Source file URI

events_df() - One row per event

from inspect_ai.analysis import (
    events_df,
    EventInfo,           # event type, span ID
    EventTiming,         # start/end times
    ModelEventColumns,   # model event data
    ToolEventColumns,    # tool event data
)

# Must specify columns (events are heterogeneous)
df = events_df("logs", columns=EventInfo + EventTiming)

# Filter to specific event types
df = events_df("logs", columns=EventInfo + ToolEventColumns,
               filter=lambda e: e.event == "tool")

EventInfo columns:

event_type - Type of event (model, tool, sandbox, etc.)
span_id - Span identifier for grouping

EventTiming columns:

timestamp - When event started
completed - When event finished
working_time - Active processing time

Joining Dataframes

Use eval_id and sample_id to join across dataframes:

# Join evals with samples
merged = samples.merge(evals, on='eval_id')

# Join samples with messages
merged = messages.merge(samples, on='sample_id')

# DuckDB integration
import duckdb
con = duckdb.connect()
con.register('evals', evals_df("logs"))
con.register('samples', samples_df("logs"))
con.execute("""
    SELECT e.model, AVG(s.score_accuracy)
    FROM samples s JOIN evals e ON s.eval_id = e.eval_id
    GROUP BY e.model
""")

Data Preparation Functions

from inspect_ai.analysis import prepare, model_info, task_info, frontier

# Add model metadata columns
df = prepare(df, model_info())
# Adds: model_organization_name, model_display_name, model_snapshot,
#       model_release_date, model_knowledge_cutoff_date

# Map task names to display names
df = prepare(df, task_info({"gpqa_diamond": "GPQA Diamond"}))

# Add frontier indicator (requires model_info first)
df = prepare(df, frontier())
# Adds boolean column: was model top-scoring at release date?

Low-level Log Reading API

from inspect_ai.log import (
    read_eval_log,
    read_eval_log_sample,
    read_eval_log_samples,
    read_eval_log_sample_summaries,
    list_eval_logs,
)

# Read complete log
log = read_eval_log("path/to/file.eval")

# Read just header (no samples) - fast for large files
log = read_eval_log("path/to/file.eval", header_only=True)

# Stream samples one at a time (memory efficient)
for sample in read_eval_log_samples("path/to/file.eval"):
    process(sample)

# Get lightweight summaries for filtering
summaries = read_eval_log_sample_summaries("path/to/file.eval")

# Read specific sample
sample = read_eval_log_sample("path/to/file.eval", id="sample_id", epoch=1)

# List all logs in directory
logs = list_eval_logs("./logs", recursive=True)

CLI Commands

# List logs with filtering
inspect log list --json
inspect log list --status success

# Dump log as JSON
inspect log dump path/to/file.eval

# Convert between formats
inspect log convert file.json --to eval --output-dir ./converted

Common Analysis Patterns

QA Verification

# Check all evaluations completed successfully
evals = evals_df("logs")
failed = evals[evals['status'] != 'success']

# Find samples with errors
samples = samples_df("logs")
errored = samples[samples['error'].notna()]

# Find samples that hit limits
limited = samples[samples['limit'].notna()]

Behavioral Analysis

# Get tool usage patterns
events = events_df("logs", columns=EventInfo + ToolEventColumns,
                   filter=lambda e: e.event == "tool")
tool_counts = events.groupby(['eval_id', 'function']).size()

# Analyze message patterns
messages = messages_df("logs")
msg_counts = messages.groupby(['eval_id', 'role']).size().unstack()

# Compare successful vs failed attempts
samples = samples_df("logs")
successful = samples[samples['score_accuracy'] == 1.0]
failed = samples[samples['score_accuracy'] == 0.0]

Cross-model Comparison

evals = evals_df("logs")
by_model = evals.groupby('model').agg({
    'score_accuracy_mean': 'mean',
    'samples_completed': 'sum'
})

Performance Tips

Use SampleSummary (default) for fast reads - only loads headers
Use parallel=True for large datasets: samples_df("logs", parallel=True)
Use header_only=True with read_eval_log() when you don't need samples
Stream with read_eval_log_samples() for memory-constrained environments
Use strict=False to get partial results: df, errors = evals_df("logs", strict=False)