read-eval-logs

star 527

View and analyse Inspect evaluation log files using the Python API. Trigger whenever you need to look at a .eval file yourself without using pre-written scripts.

UKGovernmentBEIS By UKGovernmentBEIS schedule Updated 3/6/2026

name: read-eval-logs description: View and analyse Inspect evaluation log files using the Python API. Trigger whenever you need to look at a .eval file yourself without using pre-written scripts.

Analysing Eval Log Files

This skill covers how to view and analyse Inspect evaluation log files using the Python API and CLI commands.

Quick Reference

CLI Commands

# List all logs in the default log directory (./logs or INSPECT_LOG_DIR)
uv run inspect log list --json

# List logs with specific status
uv run inspect log list --json --status success
uv run inspect log list --json --status error

# List retryable logs (error/cancelled without subsequent success)
uv run inspect log list --json --retryable

# Dump a log file as JSON (works with any format: .eval or .json)
uv run inspect log dump <log_file_path>

# Convert between log formats
uv run inspect log convert source.json --to eval --output-dir log-output
uv run inspect log convert logs/ --to eval --output-dir logs-eval

# Get JSON schema for log files
uv run inspect log schema

Interactive Log Viewer

# Start the interactive log viewer (updates automatically)
uv run inspect view

Python API

Key Imports

from inspect_ai.log import (
    # Listing and reading
    list_eval_logs,
    read_eval_log,
    read_eval_log_sample,
    read_eval_log_samples,
    read_eval_log_sample_summaries,
    
    # Writing
    write_eval_log,
    
    # Utilities
    retryable_eval_logs,
    recompute_metrics,
    resolve_sample_attachments,
    
    # Types
    EvalLog,
    EvalLogInfo,
    EvalSample,
    EvalSampleSummary,
)

Listing Logs

# List all logs in default directory
logs = list_eval_logs()

# List logs in a specific directory
logs = list_eval_logs(log_dir="./experiment-logs")

# Filter by format
logs = list_eval_logs(formats=["eval"])  # Only .eval files

# Filter with a custom function (receives header-only EvalLog)
logs = list_eval_logs(filter=lambda log: log.status == "success")

# Non-recursive listing
logs = list_eval_logs(recursive=False)

Reading Logs

# Read full log
log = read_eval_log("path/to/logfile.eval")

# Read header only (fast, excludes samples)
log = read_eval_log("path/to/logfile.eval", header_only=True)

# Read with attachments resolved
log = read_eval_log("path/to/logfile.eval", resolve_attachments=True)

Reading Samples

# Read a single sample by ID and epoch
sample = read_eval_log_sample("path/to/logfile.eval", id=42, epoch=1)

# Read a single sample by UUID
sample = read_eval_log_sample("path/to/logfile.eval", uuid="sample-uuid")

# Stream all samples (memory efficient - one at a time)
for sample in read_eval_log_samples("path/to/logfile.eval"):
    process(sample)

# Stream samples from incomplete logs
for sample in read_eval_log_samples("path/to/logfile.eval", all_samples_required=False):
    process(sample)

# Read sample summaries (fast, includes scoring info)
summaries = read_eval_log_sample_summaries("path/to/logfile.eval")

Filtering Samples

# Read only samples with errors using summaries for filtering
errors = []
for summary in read_eval_log_sample_summaries(log_file):
    if summary.error is not None:
        errors.append(
            read_eval_log_sample(log_file, summary.id, summary.epoch)
        )

EvalLog Structure

The EvalLog object contains:

Field Type Description
version int File format version (currently 2)
status str "started", "success", "cancelled", or "error"
eval EvalSpec Task, model, creation time, config
plan EvalPlan Solvers and generation config
results EvalResults Aggregate scores and metrics
stats EvalStats Runtime, model usage statistics
error EvalError Error info if status == "error"
samples list[EvalSample] Individual samples (if not header_only)
reductions list[EvalSampleReductions] Multi-epoch reductions
location str URI where log was read from

Always Check Status

log = read_eval_log("path/to/logfile.eval")
if log.status == "success":
    # Safe to analyse results
    for score in log.results.scores:
        print(f"{score.name}: {score.metrics}")

EvalSample Structure

Each sample contains:

Field Type Description
id int | str Unique sample ID
epoch int Epoch number
input str | list[ChatMessage] Sample input
target str | list[str] Expected target(s)
messages list[ChatMessage] Full conversation history
output ModelOutput Model's output
scores dict[str, Score] Scores from scorers
metadata dict[str, Any] Sample metadata
store dict[str, Any] State at end of execution
events list[Event] Transcript events
error EvalError Error if sample failed
total_time float Total sample runtime
model_usage dict[str, ModelUsage] Token usage

Common Analysis Patterns

Get Aggregate Metrics

log = read_eval_log(log_file, header_only=True)
if log.results:
    for score in log.results.scores:
        print(f"Scorer: {score.name}")
        for metric_name, metric in score.metrics.items():
            print(f"  {metric_name}: {metric.value}")

Analyse Failed Samples

log = read_eval_log(log_file)
if log.samples:
    failed = [s for s in log.samples if s.error is not None]
    for sample in failed:
        print(f"Sample {sample.id}: {sample.error.message}")

Extract Model Usage

log = read_eval_log(log_file, header_only=True)
for model, usage in log.stats.model_usage.items():
    print(f"{model}: {usage.input_tokens} in, {usage.output_tokens} out")

Compare Multiple Runs

logs = list_eval_logs(filter=lambda l: l.eval.task == "my_task")
for log_info in logs:
    log = read_eval_log(log_info, header_only=True)
    if log.results and log.results.scores:
        accuracy = log.results.scores[0].metrics.get("accuracy")
        print(f"{log.eval.model}: {accuracy.value if accuracy else 'N/A'}")

Find Retryable Logs

all_logs = list_eval_logs()
retryable = retryable_eval_logs(all_logs)
for log_info in retryable:
    print(f"Can retry: {log_info.name}")

Log File Formats

Type Description
.eval Binary format, ~1/8 size of JSON, fast incremental access
.json Text format, human-readable, slower for large files

Both formats are fully supported by the API and can be intermixed.

Working with Large Logs

For logs too large to fit in memory:

  1. Use .eval format - supports compression and incremental access
  2. Read header only - read_eval_log(log_file, header_only=True)
  3. Stream samples - read_eval_log_samples() yields one at a time
  4. Use summaries - read_eval_log_sample_summaries() for quick overview

Modifying Logs

# Read, modify, and write back
log = read_eval_log(log_file)
# ... modify log ...
write_eval_log(log)  # Uses log.location automatically

# Write to a new location
write_eval_log(log, location="new_path.eval")

# Recompute metrics after score edits
recompute_metrics(log)
write_eval_log(log)

Environment Variables

Variable Description
INSPECT_LOG_DIR Default log directory (default: ./logs)
INSPECT_EVAL_LOG_FILE_PATTERN Log filename pattern (e.g., {task}_{model}_{id})

Related Tools

  • Inspect Scout - Transcript analysis
  • Inspect Viz - Data visualization
  • Log Dataframes - Extract pandas DataFrames from logs (see inspect_ai.analysis)
Install via CLI
npx skills add https://github.com/UKGovernmentBEIS/inspect_evals --skill read-eval-logs
Repository Details
star Stars 527
call_split Forks 349
navigation Branch main
article Path SKILL.md
More from Creator
UKGovernmentBEIS
UKGovernmentBEIS Explore all skills →