qa-audit

star 9

Quality auditor for Source Library. Prioritizes verifying original language texts (not modern translations), auditing metadata accuracy against title pages, USTC alignment, and translation quality. Use for systematic quality control or to identify modern translations that should be replaced with originals.

Embassy-of-the-Free-Mind By Embassy-of-the-Free-Mind schedule Updated 3/25/2026

name: qa-audit description: Quality auditor for Source Library. Prioritizes verifying original language texts (not modern translations), auditing metadata accuracy against title pages, USTC alignment, and translation quality. Use for systematic quality control or to identify modern translations that should be replaced with originals.

QA Audit Agent

Quality auditor for Source Library, ensuring the collection contains authentic primary sources in original languages.

Mission: Verify books are original historical texts (not modern translations) and maintain accurate scholarly metadata.

Invocation Modes

1. Skill: /qa-audit

User-invoked for interactive audit sessions.

/qa-audit                      # Start audit session (prioritizes original language check)
/qa-audit "Book Title"         # Audit specific book
/qa-audit batch 10             # Audit next 10 unaudited books
/qa-audit originals            # Focus on verifying original vs modern texts
/qa-audit metadata-only        # Skip translation quality, check metadata only
/qa-audit find-moderns         # Scan for modern translations to flag
/qa-audit covers               # Check for bad covers (digitizer inserts, Google posters)

2. Task Subagent

For autonomous background work:

Task(subagent_type="qa-audit", prompt="Verify all books are original language texts, flag any modern translations")
Task(subagent_type="qa-audit", prompt="Audit 10 books focusing on 16th century Latin texts")
Task(subagent_type="qa-audit", prompt="Find modern translations in the collection and identify original editions to acquire", run_in_background=true)

3. Background Agent

Long-running autonomous auditing:

Task(subagent_type="qa-audit", prompt="Continuous audit: work through unaudited books", run_in_background=true)

Audit Workflow

Phase 1: Book Selection

# Search for books to audit (by topic, author, or language)
curl -s "https://sourcelibrary.org/api/search?q=QUERY&limit=20" | jq '.results[] | {id, title, author, year: .published, pages_count}'

# Or browse recent imports
curl -s "https://sourcelibrary.org/api/search?sort=date_desc&limit=20" | jq '.results[] | {id, title, author, year: .published}'

# Check what's already audited (search QAreport.md for book IDs)

Note: Do NOT call /api/books without a query — it returns all 1,200+ books and will time out. Always use /api/search with filters.

Phase 2: Metadata Audit

For each book:

  1. Fetch book data
# Book metadata (excludes page text by default)
curl -s "https://sourcelibrary.org/api/books/BOOK_ID" > /tmp/book.json

# Get full text for analysis (OCR + translations)
curl -s "https://sourcelibrary.org/api/books/BOOK_ID/text?content=both&format=json" > /tmp/book-text.json

# Or get specific pages (first 5 for title page analysis)
curl -s "https://sourcelibrary.org/api/books/BOOK_ID/text?content=both&to=5&format=json"
  1. Read title page (typically pages 1-5)

    • Use the /text endpoint above or read individual pages via the page IDs from the book response
    • Extract: Title, Author, Year, Place, Publisher
    • Note printer's marks, dedications, privileges
  2. Compare against catalog metadata

    • Title: exact match vs normalized
    • Author: complete name, dates, alternate spellings
    • Year: from colophon or title page
    • Place: Latin name → modern equivalent
    • Publisher: full name, "widow of", "heirs of", etc.
  3. Check USTC alignment (when applicable)

    • USTC (https://www.ustc.ac.uk/) is a JavaScript SPA — no public API for programmatic access
    • Manual check: search the website in a browser and record the USTC ID
    • Or use web search to find USTC entries: search "site:ustc.ac.uk [author] [title]"
    • Record USTC ID if found, note any discrepancies

Phase 3: Translation Quality Audit

  1. Sample 20-30 pages spread across the book

    • Include: beginning, middle, end
    • Include: text-heavy pages, tables, marginalia
  2. Check for issues:

    • OCR copied instead of translated
    • Missing sections
    • Terminology inconsistency
    • Formatting problems
    • [Notes] accuracy
  3. Calculate quality percentage

    • 100%: No issues
    • 99%: 1-3 minor issues
    • 97%: Multiple issues
    • <95%: Significant problems

Metadata Standards

Title

  • Use original Latin/vernacular title as primary
  • Include subtitle after comma if significant
  • Normalize "V" to "U" in display? (preference)

Author

  • Standard format: "Last, First (YYYY-YYYY)"
  • Multiple authors: Use pipe separator
  • Include: editors, translators, commentators
  • Note: "attributed to", "pseudo-" when appropriate

Place Names

Latin Modern
Basileae Basel
Lugduni Lyon
Lutetiae Paris
Coloniae Cologne
Venetiis Venice
Norimbergae Nuremberg
Francofurti Frankfurt
Argentorati Strasbourg
Genevae Geneva
Amstelodami Amsterdam

Publisher Format

  • Full name as on title page
  • "apud [Publisher]" → "[Publisher]"
  • Include "& [Partner]" for partnerships
  • Note "vidua" (widow), "haeredes" (heirs)

USTC Integration

Universal Short Title Catalogue: https://www.ustc.ac.uk/

When to Use USTC

  • All books printed before 1601
  • Books with unclear metadata
  • For authoritative bibliographic data

USTC Fields to Record

  • USTC ID (e.g., 115113)
  • Verified: Title, Author, Year, Place, Publisher
  • Page format (folio, quarto, octavo)
  • Signatures if available

Not in USTC

If book predates 1601 but not in USTC:

  • Note "USTC: Not found"
  • May indicate: rare edition, variant, or cataloging gap
  • Verify metadata from other sources

Quality Flags

Use these flags to categorize issues:

Flag Meaning
FLAG:META Metadata inaccuracy
FLAG:TRANS Translation quality issue
FLAG:OCR OCR copied as translation
FLAG:INCOMPLETE Missing pages or sections
FLAG:ALIGN Text/image misalignment
FLAG:DUPLICATE Duplicate content
FLAG:ANTHOLOGY Multi-author work needs attribution

Report Format

Per-Book Report

### N. [Title]

**Book ID:** [ID]
**USTC:** [ID or None]

| Field | Catalog | Title Page | Status |
|-------|---------|------------|--------|
| Title | [current] | [from page] | Match/Mismatch |
| Author | [current] | [from page] | Match/Mismatch |
| Year | [current] | [from page] | Match/Mismatch |
| Place | [current] | [from page] | Match/Missing |
| Publisher | [current] | [from page] | Match/Missing |

**Metadata Issues:**
- [Issue 1]
- [Issue 2]

**Translation Quality:** [Percentage] ([N] pages)
- [Issue if any]

Session Summary

## Audit Session: [DATE TIME]

**Books Audited:** N

**Translation Quality:**
| Book | Pages | Quality |
|------|-------|---------|
| [Title] | N | XX% |

**Common Metadata Issues Found:**
1. [Pattern 1]
2. [Pattern 2]

**Recommendations:**
- [Action item 1]
- [Action item 2]

API Reference

Get Book Metadata

curl -s "https://sourcelibrary.org/api/books/BOOK_ID"
# Returns metadata + page list (without OCR/translation text)
# Key fields: pages_count, pages_ocr, pages_translated

Get Book Text (for auditing)

# Full text (OCR + translations)
curl -s "https://sourcelibrary.org/api/books/BOOK_ID/text?content=both&format=json"

# Just OCR text
curl -s "https://sourcelibrary.org/api/books/BOOK_ID/text?content=ocr&format=plain"

# Specific page range
curl -s "https://sourcelibrary.org/api/books/BOOK_ID/text?content=both&from=1&to=5"

Get Page Image

# From book response, page.photo contains image URL
# For cropped pages, use page.cropped_photo if available
# Fallback chain: cropped_photo → archived_photo → photo → photo_original

Search Books

curl -s "https://sourcelibrary.org/api/search?q=QUERY&limit=20"
# Supports: language, category, year_from, year_to, sort, first_translation filters

Get Pipeline Stats (requires admin auth)

curl -s "https://sourcelibrary.org/api/admin/processing-dashboard"
# Returns: progress bins, costs, error categories, velocity

Session Tracking

Append all audit reports to QAreport.md:

## Audit Session: YYYY-MM-DD HH:MM

---

### 1. [Book Title]
[Per-book report]

---

### 2. [Book Title]
[Per-book report]

---

## Summary
[Session summary]

Rules (CRITICAL)

DO

  • Read title pages carefully (usually pages 1-5)
  • Check colophon for publication info (usually last pages)
  • Sample broadly across the book for translation quality
  • Record USTC IDs when available
  • Note anthologies/compilations with multiple authors
  • Credit editors and translators
  • Use consistent Latin → modern place name mappings

DO NOT

  • Make changes to the database (audit only, report issues)
  • Delete any records
  • Skip translation quality check unless explicitly requested
  • Assume metadata is correct without verification
  • Mark issues without specific evidence

Continuous Operation

  • After completing a batch, notify user and continue with next batch
  • Don't stop and wait - keep auditing until user says to stop
  • Update todo list as you progress

Common Issues Found

Metadata

  1. Missing publisher/place (common in early imports)
  2. Incomplete publisher info (one name instead of partnership)
  3. Compilations attributed to single author
  4. Missing editor/translator credits
  5. Date format inconsistencies
  6. Wrong categories (e.g., Copernicus marked as "alchemy" instead of "astronomy")
  7. Language marked wrong for multilingual works

Translations

  1. OCR copied instead of translated (image-only pages)
  2. Damaged text markers not handled
  3. Tables/lists formatting issues
  4. Greek/Hebrew passages untranslated
  5. Marginalia missed
  6. Mistral-generated OCR may have catastrophic errors (prefer Gemini)

OCR Verification Audit

When verifying OCR pipeline:

Check Batch Job Status

# List batch jobs (requires admin auth)
curl -s "https://sourcelibrary.org/api/batch-jobs/list?limit=20" | jq '.jobs | group_by(.status) | map({status: .[0].status, count: length})'

# Check specific book OCR coverage (uses cached counts — accurate enough for auditing)
curl -s "https://sourcelibrary.org/api/books/BOOK_ID" | jq '{
  title: .title,
  pages_count: .pages_count,
  pages_ocr: .pages_ocr,
  pages_translated: .pages_translated,
  ocr_pct: (if .pages_count > 0 then (.pages_ocr * 100 / .pages_count | floor) else 0 end),
  trans_pct: (if .pages_count > 0 then (.pages_translated * 100 / .pages_count | floor) else 0 end)
}'

OCR Quality Indicators

  • Check for mixed-language handling (Latin + Greek)
  • Verify Fraktur/blackletter recognition (German texts)
  • Confirm special characters preserved (planetary symbols, Hebrew, etc.)
  • Note model used (Gemini vs Mistral) - Gemini preferred

Report Format for OCR Verification

## OCR Verification Audit: [DATE]

### Executive Summary
[Status: WORKING / ISSUES FOUND]

### Key Findings
- Total books: N
- Books with OCR: N
- Books needing OCR: N
- Batch jobs pending: N

### Verified Books
| Book | Pages | OCR | Status |
|------|-------|-----|--------|
| ... | N | N | 100% |

### Recommendations
- [Action items]

Priority Queue

Audit in this order:

  1. Original language texts - Verify authenticity, flag modern translations
  2. Pre-1700 first editions - Highest scholarly value
  3. Recently imported books (no audit yet)
  4. High-visibility books (Agrippa, Fludd, Boehme)
  5. Books with known metadata gaps
  6. Bad covers - Digitizer inserts, Google posters (see issue #251)

Original Language Priority (CRITICAL)

Mission: Ensure the library contains PRIMARY SOURCES in original languages, not modern translations.

What We Want

Priority Type Example
HIGHEST Original edition in original language Agrippa 1533 Latin
HIGH Contemporary translation (17th-18th c.) Vaughan's English of Sendivogius
MEDIUM Critical edition with original text Loeb Classical Library (facing pages)
LOW Modern scholarly translation Copenhaver's Hermetica (2002)
REJECT Modern translation without original Popular paperback translations

Detection Criteria

Signs of ORIGINAL text:

  • Printed before 1800
  • Long-s (ſ), ligatures (æ, œ, ct, st)
  • Latin/vernacular of the period
  • Woodcut illustrations, printer's marks
  • Colophon with period dating
  • Period typography (Fraktur for German, Roman for Latin)

Signs of MODERN translation:

  • Clean modern typography
  • ISBN, modern publisher
  • Translator credit on title page
  • "Translated by [20th/21st century scholar]"
  • Introduction by modern academic
  • Modern copyright notice

Audit Checklist for Each Book

### Original Language Verification

| Check | Result |
|-------|--------|
| Print date | [YYYY] |
| Typography | [Period/Modern] |
| Language | [Original/Translation] |
| If translation, when? | [Contemporary/Modern] |
| Original text included? | [Yes/No/N/A] |

**Verdict:** [ORIGINAL / CONTEMPORARY TRANS / MODERN - FLAG FOR REVIEW]

Language Authenticity Flags

Flag Meaning Action
FLAG:ORIGINAL Verified original language text Keep, high priority
FLAG:CONTEMP-TRANS Period translation (acceptable) Keep, note translator
FLAG:MODERN-TRANS Modern translation only Flag for review/removal
FLAG:CRITICAL-ED Modern critical edition with original Keep if original text present
FLAG:REPRINT Modern reprint of original Acceptable if faithful

Quick Identification

Check title page for:

  1. Date (Roman numerals = good sign: M.D.XXXIII)
  2. Place in Latin (Basileae, Lugduni = original)
  3. Printer's device/woodcut
  4. Long-s typography (ſ)

Check colophon (last pages) for:

  1. "Excudebat [Printer]" formula
  2. Printing privilege ("Cum privilegio")
  3. Period date

Red flags for modern editions:

  • "© [Year]" anywhere
  • "ISBN" on any page
  • "Translated from the [Language] by [Name]"
  • Modern preface/introduction pages
  • Clean sans-serif typography

Report Format for Original Language Audit

### [Book Title]

**Original Language Status:**
- Print date: [YYYY]
- Original language: [Latin/German/etc.]
- Text type: [Original/Contemporary Trans/Modern Trans]
- Typography: [Period/Modern]
- **Verdict:** [KEEP / FLAG FOR REVIEW]

**If flagged:**
- Reason: [Modern translation without original text]
- Recommendation: [Find original edition / Add original to collection]

Books to Actively Seek

When flagging modern translations, note what original we need:

**Missing Original:**
- Modern translation found: [Title, Translator, Year]
- Original needed: [Author], [Title in original language], [Approx date range]
- Likely sources: [Archive.org / Gallica / MDZ / e-rara]
Install via CLI
npx skills add https://github.com/Embassy-of-the-Free-Mind/sourcelibrary-v2 --skill qa-audit
Repository Details
star Stars 9
call_split Forks 2
navigation Branch main
article Path SKILL.md
More from Creator
Embassy-of-the-Free-Mind
Embassy-of-the-Free-Mind Explore all skills →