inspect-ai

star 21

Analyze Inspect AI evaluation logs, understand EvalLog structure, extract samples, events, and scoring data using dataframes

UKGovernmentBEIS By UKGovernmentBEIS schedule Updated 2/3/2026

name: inspect-ai description: Analyze Inspect AI evaluation logs, understand EvalLog structure, extract samples, events, and scoring data using dataframes user-invocable: false

Inspect AI Log Analysis Reference

Use this knowledge when working with Inspect AI evaluation logs (.eval or .json files).

File Format

.eval files are binary (compressed) format containing JSON data. They are essentially zip archives. To read them programmatically, use the Inspect Python API.

Core Data Structures

EvalLog (Top-level)

class EvalLog:
    version: int                    # File format version (currently 2)
    status: str                     # "started", "success", or "error"
    eval: EvalSpec                  # Task, model, creation time
    plan: EvalPlan                  # Solvers and generation config
    results: EvalResults            # Aggregate scorer metrics
    stats: EvalStats                # Token usage statistics
    error: EvalError | None         # Error details if status="error"
    samples: list[EvalSample]       # Individual sample records
    reductions: list[EvalSampleReduction]  # Multi-epoch reductions

Always check log.status == "success" before analyzing results.

EvalSample (Per-sample data)

class EvalSample:
    id: int | str                   # Unique sample identifier
    epoch: int                      # Epoch number (for multi-epoch runs)
    input: str | list[ChatMessage]  # The prompt/task given to model
    target: str | list[str]         # Expected answer(s)
    choices: list[str] | None       # Multiple choice options if applicable

    # Execution results
    messages: list[ChatMessage]     # Full conversation history
    output: ModelOutput             # Final model output
    scores: dict[str, Score] | None # Scoring results by scorer name

    # Events and state
    events: list[Event]             # Complete transcript of all events
    store: dict[str, Any]           # State at end of execution
    attachments: dict[str, str]     # Referenced content (images, etc.)

    # Metadata
    metadata: dict[str, Any]        # Custom key-value pairs from dataset
    sandbox: SandboxEnvironmentSpec | None  # Sandbox config
    files: list[str] | None         # Files provided to sandbox
    setup: str | None               # Setup script run in sandbox

    # Timing
    started_at: datetime | None
    completed_at: datetime | None
    total_time: float | None        # Wall clock time
    working_time: float | None      # Active processing time

    # Token usage
    model_usage: dict[str, ModelUsage]  # Tokens by model

    # Error/limit info
    error: EvalError | None         # Error that halted sample
    error_retries: list[EvalError] | None  # Retried errors
    limit: EvalSampleLimit | None   # Limit that halted sample (context/time/message/token)

Event Types (for behavioral analysis)

Events are the core of behavioral analysis. Each sample has an events list containing:

Event = Union[
    SampleInitEvent,    # Sample initialization
    SampleLimitEvent,   # Limit reached
    SandboxEvent,       # Sandbox operations (exec, read_file, write_file)
    StateEvent,         # State changes
    StoreEvent,         # Store updates
    ModelEvent,         # LLM API calls
    ToolEvent,          # Tool invocations
    ApprovalEvent,      # Human approval events
    InputEvent,         # User input
    ScoreEvent,         # Scoring events
    ScoreEditEvent,     # Score modifications
    ErrorEvent,         # Errors
    LoggerEvent,        # Log messages
    InfoEvent,          # Info messages
    SpanBeginEvent,     # Span start (for timing)
    SpanEndEvent,       # Span end
    StepEvent,          # Solver step events
    SubtaskEvent,       # Subtask events
]

ModelEvent (LLM calls)

class ModelEvent:
    event: "model"
    model: str                      # Model name
    input: list[ChatMessage]        # Messages sent to model
    tools: list[ToolInfo]           # Available tools
    tool_choice: ToolChoice         # Tool selection directive
    config: GenerateConfig          # Generation parameters
    output: ModelOutput             # Model response
    retries: int | None             # API retries
    error: str | None               # Error if failed
    cache: "read" | "write" | None  # Cache hit/miss
    timestamp: datetime             # When call started
    completed: datetime | None      # When call finished
    working_time: float | None      # Processing time

ToolEvent (Tool calls)

class ToolEvent:
    event: "tool"
    id: str                         # Unique tool call ID
    function: str                   # Tool/function name
    arguments: dict[str, JsonValue] # Arguments passed
    result: ToolResult              # Return value
    error: ToolCallError | None     # Error if failed
    truncated: tuple[int, int] | None  # If output was truncated
    timestamp: datetime             # When call started
    completed: datetime | None      # When call finished
    working_time: float | None      # Processing time
    agent: str | None               # Agent name if handoff
    failed: bool | None             # Hard failure flag

Score

class Score:
    value: float | str | int | bool | list  # The score value
    answer: str | None              # Model's answer extracted
    explanation: str | None         # Explanation of score
    metadata: dict[str, Any] | None # Additional scoring metadata
    history: list[ScoreEdit]        # Edit history (history[0] = original)

Dataframe API (Primary Analysis Method)

The inspect_ai.analysis module provides functions to convert logs into Pandas dataframes.

evals_df() - One row per evaluation

from inspect_ai.analysis import evals_df

df = evals_df("logs")  # Read all logs in directory
df = evals_df(["path/to/file1.eval", "path/to/file2.eval"])  # Specific files

Default columns (~51):

  • eval_id - Unique evaluation identifier
  • log - URI of source file
  • task, task_version, task_file, task_arg_* - Task info
  • model, model_args, generate_config_* - Model info
  • status, error - Completion status
  • score_<scorer>_<metric> - All scores expanded as columns
  • samples_completed, samples_total
  • created, git_commit, tags, metadata_*

Pre-built column groups:

from inspect_ai.analysis import (
    EvalInfo,      # created, tags, metadata, git
    EvalTask,      # task name, file, args, solver
    EvalModel,     # model name, args, generation config
    EvalDataset,   # dataset name, location, sample IDs
    EvalConfig,    # epochs, approval, sample limits
    EvalResults,   # status, errors, samples completed
    EvalScores,    # all scores as separate columns
    EvalColumns,   # all of the above (~50 columns)
)

samples_df() - One row per sample

from inspect_ai.analysis import samples_df, SampleSummary, SampleScores, SampleMessages

# Fast read (summaries only, 12 columns)
df = samples_df("logs")

# With detailed scores
df = samples_df("logs", columns=SampleSummary + SampleScores)

# With message content (slower, loads full samples)
df = samples_df("logs", columns=SampleSummary + SampleMessages)

SampleSummary columns (default, 12 columns):

  • sample_id - Globally unique identifier
  • eval_id - Links to evaluation
  • id, epoch - Sample ID within eval and epoch number
  • input, target - Task input and expected output
  • metadata_* - Expanded metadata dictionary
  • score_* - Score values only
  • model_usage - Token counts
  • total_time, working_time - Timing data
  • error, limit, retries - Failure info
  • log - Source file URI

SampleScores adds:

  • Score answer, explanation, metadata

SampleMessages adds:

  • Full message content (requires loading full sample)

messages_df() - One row per message

from inspect_ai.analysis import messages_df

# All messages
df = messages_df("logs")

# Filter by role
df = messages_df("logs", filter=["assistant"])
df = messages_df("logs", filter=["user", "assistant"])

# Custom filter function
df = messages_df("logs", filter=lambda msg: "error" in msg.content.lower())

Default columns:

  • sample_id, eval_id - Links to sample and evaluation
  • event_id - Unique message identifier
  • role - user, assistant, system, tool
  • content - Message text
  • source - Origin of message
  • tool_calls - Formatted function calls
  • tool_call_id, tool_call_function, tool_call_error
  • log - Source file URI

events_df() - One row per event

from inspect_ai.analysis import (
    events_df,
    EventInfo,           # event type, span ID
    EventTiming,         # start/end times
    ModelEventColumns,   # model event data
    ToolEventColumns,    # tool event data
)

# Must specify columns (events are heterogeneous)
df = events_df("logs", columns=EventInfo + EventTiming)

# Filter to specific event types
df = events_df("logs", columns=EventInfo + ToolEventColumns,
               filter=lambda e: e.event == "tool")

EventInfo columns:

  • event_type - Type of event (model, tool, sandbox, etc.)
  • span_id - Span identifier for grouping

EventTiming columns:

  • timestamp - When event started
  • completed - When event finished
  • working_time - Active processing time

Joining Dataframes

Use eval_id and sample_id to join across dataframes:

# Join evals with samples
merged = samples.merge(evals, on='eval_id')

# Join samples with messages
merged = messages.merge(samples, on='sample_id')

# DuckDB integration
import duckdb
con = duckdb.connect()
con.register('evals', evals_df("logs"))
con.register('samples', samples_df("logs"))
con.execute("""
    SELECT e.model, AVG(s.score_accuracy)
    FROM samples s JOIN evals e ON s.eval_id = e.eval_id
    GROUP BY e.model
""")

Data Preparation Functions

from inspect_ai.analysis import prepare, model_info, task_info, frontier

# Add model metadata columns
df = prepare(df, model_info())
# Adds: model_organization_name, model_display_name, model_snapshot,
#       model_release_date, model_knowledge_cutoff_date

# Map task names to display names
df = prepare(df, task_info({"gpqa_diamond": "GPQA Diamond"}))

# Add frontier indicator (requires model_info first)
df = prepare(df, frontier())
# Adds boolean column: was model top-scoring at release date?

Low-level Log Reading API

from inspect_ai.log import (
    read_eval_log,
    read_eval_log_sample,
    read_eval_log_samples,
    read_eval_log_sample_summaries,
    list_eval_logs,
)

# Read complete log
log = read_eval_log("path/to/file.eval")

# Read just header (no samples) - fast for large files
log = read_eval_log("path/to/file.eval", header_only=True)

# Stream samples one at a time (memory efficient)
for sample in read_eval_log_samples("path/to/file.eval"):
    process(sample)

# Get lightweight summaries for filtering
summaries = read_eval_log_sample_summaries("path/to/file.eval")

# Read specific sample
sample = read_eval_log_sample("path/to/file.eval", id="sample_id", epoch=1)

# List all logs in directory
logs = list_eval_logs("./logs", recursive=True)

CLI Commands

# List logs with filtering
inspect log list --json
inspect log list --status success

# Dump log as JSON
inspect log dump path/to/file.eval

# Convert between formats
inspect log convert file.json --to eval --output-dir ./converted

Common Analysis Patterns

QA Verification

# Check all evaluations completed successfully
evals = evals_df("logs")
failed = evals[evals['status'] != 'success']

# Find samples with errors
samples = samples_df("logs")
errored = samples[samples['error'].notna()]

# Find samples that hit limits
limited = samples[samples['limit'].notna()]

Behavioral Analysis

# Get tool usage patterns
events = events_df("logs", columns=EventInfo + ToolEventColumns,
                   filter=lambda e: e.event == "tool")
tool_counts = events.groupby(['eval_id', 'function']).size()

# Analyze message patterns
messages = messages_df("logs")
msg_counts = messages.groupby(['eval_id', 'role']).size().unstack()

# Compare successful vs failed attempts
samples = samples_df("logs")
successful = samples[samples['score_accuracy'] == 1.0]
failed = samples[samples['score_accuracy'] == 0.0]

Cross-model Comparison

evals = evals_df("logs")
by_model = evals.groupby('model').agg({
    'score_accuracy_mean': 'mean',
    'samples_completed': 'sum'
})

Performance Tips

  • Use SampleSummary (default) for fast reads - only loads headers
  • Use parallel=True for large datasets: samples_df("logs", parallel=True)
  • Use header_only=True with read_eval_log() when you don't need samples
  • Stream with read_eval_log_samples() for memory-constrained environments
  • Use strict=False to get partial results: df, errors = evals_df("logs", strict=False)
Install via CLI
npx skills add https://github.com/UKGovernmentBEIS/sandbox_escape_bench --skill inspect-ai
Repository Details
star Stars 21
call_split Forks 6
navigation Branch main
article Path SKILL.md
More from Creator
UKGovernmentBEIS
UKGovernmentBEIS Explore all skills →