name: "path-metadata-heuristics" description: "Heuristics for extracting book metadata from aithena library paths" domain: "backend, indexing" confidence: "high" source: "earned — Parker Phase 1 Solr indexer rewrite, validated during lister/indexer bugfixes and 169-file real library indexing (Sessions 2–3)"
Context
Use this when deriving title, author, year, and category from book paths before indexing into Solr.
Patterns
Honor explicit filename structure first
Author - Title (Year).pdf→ author/title/year from the filenameCategory/Author - Title (Year).pdf→ category from folder, author/title/year from filename
Use folder depth to separate category vs author
Category/Author/Title.pdf→ first folder is category, second folder is authorAuthor/Title.pdf→ parent folder is author when the filename does not look like a series/journal issue
Handle real aithena library cases
amades/Auca ... amades.pdf→ treatamadesas author and strip the repeated author suffix from the titlebalearics/ESTUDIS_BALEARICS_01.pdf→ treatbalearicsas category, keep the filename as title text, and useauthor="Unknown"bsal/Bolletí ... 1885 - 1886.pdf→ treatbsalas category; year ranges are metadata, notAuthor - Titleseparators
Always provide fallbacks
- Default
titleto the filename stem with underscores normalized to spaces - Default
authortoUnknown - Return
file_path,folder_path, andfile_sizealongside parsed metadata
- Default
Anti-Patterns
- Do not split on every
-blindly — periodicals with year ranges will be misparsed asAuthor - Title - Do not assume a single top-level folder is always an author — some library folders are categories or journal series