name: processing-invoices description: Extracts, validates, and structures data from PDF invoices with automated validation and error correction loops. Use when processing invoice PDFs, extracting billing data, batch processing invoices, or when accuracy is critical.
Invoice Processing
Workflow
Invoice Processing:
- [ ] Step 1: Log start time
- [ ] Step 2: Extract PDF text
- [ ] Step 3: Parse invoice fields
- [ ] Step 4: Validate (run validate_invoice.py)
- [ ] Step 5: Fix errors and re-validate if needed
- [ ] Step 6: Save final output AND eval log
Step 1: Log start time
Record the start time for eval tracking:
from datetime import datetime
start_time = datetime.now().isoformat()
Step 2: Extract text
from pypdf import PdfReader
reader = PdfReader("invoice.pdf")
text = ""
for page in reader.pages:
page_text = page.extract_text()
if page_text:
text += page_text + "\n"
Step 3: Parse fields
Extract from text:
- vendor: Company name (usually at document top, in larger font)
- invoice_number: Look for "Invoice #", "Invoice No.", "INV-", "#"
- date: Invoice/billing date -> convert to YYYY-MM-DD
- total: Final amount ("Total:", "Amount Due:", "Balance:")
- currency: Default USD if not specified
Step 4: Validate
Run: python scripts/validate_invoice.py output.json
Step 5: Fix and re-validate
If validation fails:
- Read the specific error message
- Re-examine PDF text for that field
- Update JSON with corrected value
- Run validation again
- Repeat until validation passes
Common issues: See TROUBLESHOOTING.md
Step 6: Save results
Save two files:
- Output file (requested by user):
{
"vendor": "...",
"invoice_number": "...",
"date": "YYYY-MM-DD",
"total": 0.00
}
- Eval log (always append to
eval_results/all_evals.jsonl):
python scripts/collect_eval.py "<task_id>" "<original_task_prompt>" "<output_file>" "<notes>"
Example:
python scripts/collect_eval.py "invoice-basic" "Extract invoice data from invoice.pdf" "output.json" "validation passed on first attempt"
Always append to eval_results/all_evals.jsonl (one JSON per line) if it exists.
Output format
{
"vendor": "Company Name",
"invoice_number": "INV-2025-001",
"date": "2025-01-15",
"total": 1250.00,
"currency": "USD",
"line_items": []
}
Validation rules
See VALIDATION.md for complete rules.