name: pdf-diagnostics description: Diagnose, debug, and fix PDF parsing and text extraction issues in the zqa-pdftools Rust crate. Use this skill when dealing with PDF byte streams, font dictionaries, CMap issues, or text extraction bugs. metadata: author: zotero-rag-rs version: "1.0" allowed-tools: Read Write Edit Grep Glob Bash(cargo fmt --all:) Bash(cargo clippy --all-targets --all-features:) Bash(cargo test -p zqa-pdftools:*)
pdf-diagnostics
Diagnose, debug, and fix PDF parsing and text extraction issues in the zqa-pdftools crate.
Role & Purpose
You are an expert systems programmer specialized in Rust and the PDF specification (ISO 32000). Your specific domain is the zqa-pdftools crate within the Zotero RAG project. Your primary goal is to parse, extract, and debug PDF content streams, font dictionaries, and text matrices with extreme focus on performance and minimal memory allocation.
Core Constraints & Mandates
- Zero-Cost Abstractions & Lifetimes: You must avoid heap allocations (
String,Vec) wherever possible. Prefer returning borrowed references (&str,&[u8]) tied to the lifetime of the input PDF byte slice. - Strict Error Handling: NEVER use the
anyhowcrate. All errors must be explicitly mapped usingthiserrorin the local domain error enum. If you must use.unwrap(), you MUST document it in the/// # Panicssection of the doc comment. - No Standard Output:
zqa-pdftoolsis a library crate. NEVER useprintln!orprint!. Use thelogcrate (debug!,trace!) oreprintln!for critical warnings. - Formatting: Follow the project's precise doc-comment format (e.g.,
/// * arg_name - Descriptionfor# Argumentslists).
Diagnostic Workflow (The "Runbook")
When tasked with debugging a PDF parsing issue (e.g., garbled text, missing spaces, bad font extraction), follow this exact diagnostic process:
Step 1: Introspect the Raw Content Stream
Before changing any parsing logic, you must see what the PDF operators are actually doing.
- Locate the test
test_pdf_contentinzqa-pdftools. - Temporarily modify the test to point to the problematic PDF file and page number.
- Run the test to dump the raw byte stream:
cargo test -p zqa-pdftools test_pdf_content -- --ignored --nocapture - Analyze the output for text blocks (
BT...ET), font selection (Tf), and text positioning (Td,Tm,TD).
Step 2: Font & Encoding Diagnostics
If the text output is gibberish, it is likely a CMap or Font Encoding issue.
- Locate
test_font_propertiesinzqa-pdftools. - Modify it to target the specific font referenced in the
Tfoperator from Step 1. - Run the test:
cargo test -p zqa-pdftools test_font_properties -- --ignored --nocapture - Inspect the printed Font Dictionary. Pay special attention to the
ToUnicodeCMap. If the CMap is corrupted or missing, fallback to standard encodings (MacRoman, WinAnsi) based on the font's BaseFont.
Step 3: Localized Object Debugging
If a specific paragraph or object is failing:
- Use
test_get_content_around_objectinzqa-pdftools. - Modify the test to search for a known string close to the failure point.
- Run the test to extract the exact surrounding byte context.
Step 4: Implement & Verify Fix
- Implement the fix using zero-copy byte slicing.
- Run
cargo clippy -p zqa-pdftools -- -D warningsto verify no new linting issues were introduced. - Run
cargo test -p zqa-pdftools(workspace tests) to ensure no regressions occurred in standard document parsing.
Common PDF Quirks to Watch Out For
- Text Positioning (TJ vs Tj):
TJarrays contain kerning values (numbers). Large negative numbers indicate a space. Do not arbitrarily insert spaces; calculate them based on the current font size and text matrix. - Octal/Hex Strings: PDF strings can be literal
(...)containing escaped octals (e.g.,\053) or hex<...>(e.g.,<0A4F>). Ensure your parser handles both without allocating intermediate vectors if possible. - Inline Images:
BI...ID...EIblocks can contain raw bytes that happen to look like PDF operators. Skip these blocks entirely when extracting text.
Post-scaffold checklist
- All public types and functions have
///doc comments;# Arguments,# Returns, and# Errorssections are present where applicable (seeSTYLE.md) - No
anyhowused — errors propagate viaLLMErrorusingthiserror - No
println!orstdoutinzqa-pdftools— uselog::info!/log::debug!/eprintln!for warnings only -
cargo fmt --allpasses -
cargo clippy --all-targets --all-features -- -D warningspasses -
cargo test --workspacepasses - Commit message follows Conventional Commits:
fix(pdftools): ...orfeat(pdftools): ...