pdf-to-dita

name: pdf-to-dita description: Converts a PDF document into a complete DITA 1.3 guide by running a 5-stage automated pipeline: PDF text extraction, topic chunking, validation, best-practices optimization, and full-guide review. Use this skill when you need to migrate legacy PDF documentation into structured DITA XML, bootstrap a DITA project from an existing PDF manual, or verify the quality of the generated DITA output. Integrates with validate-dita-topic, validate-ditamap, dita-best-practices, and review-dita-guide. compatibility: Python 3.9+. Stage 1 requires pdfplumber (pip install pdfplumber). Stages 2–5 use stdlib only and call the existing dita-skills validator scripts.

Overview

pdf-to-dita orchestrates a five-stage pipeline that transforms any PDF into a well-structured DITA 1.3 guide, validates every generated file, applies best-practices fixes, and produces a consolidated quality report.

PDF  →  [1] Extract  →  [2] Chunk  →  [3] Validate  →  [4] Optimize  →  [5] Review
         extracted.json   topics/*.dita  validation_       optimization_    review_
                          root.ditamap   report.json       report.json      report.json

Pipeline stages

Stage 1 — Extract (`extract_pdf.py`)

Reads the PDF with pdfplumber and produces a structural JSON document.

Detects heading levels from font-size ratios (body text = most common size; ratio ≥ 1.15 → heading).
Preserves paragraphs, ordered/unordered lists, and tables.
Outputs extracted.json with a nested sections tree.

python pdf-to-dita/scripts/extract_pdf.py input.pdf > extracted.json

Output excerpt:

{
  "source": "/abs/path/input.pdf",
  "title":  "Getting Started Guide",
  "pages":  42,
  "sections": [
    {
      "level": 1,
      "title": "Installation",
      "content": [
        {"type": "paragraph", "text": "This guide covers..."},
        {"type": "list", "ordered": true, "items": ["Download the package", "Run setup.sh"]}
      ],
      "subsections": []
    }
  ]
}

Stage 2 — Chunk (`chunk_to_dita.py`)

Converts the extracted JSON into DITA 1.3 XML topics and a root ditamap.

Topic-type heuristics:

Signal	Assigned type
Title matches action verbs (configure, install, create, …)	`task`
Title matches (reference, parameters, api, settings, …)	`reference`
Content has ≥ 3 ordered-list items	`task`
Table headers include `name`, `parameter`, `property`	`reference`
Default	`concept`

Output structure:

output_dir/
├── root.ditamap
└── topics/
    ├── installation.dita        (task)
    ├── configuration-options.dita (reference)
    └── overview.dita             (concept)

python pdf-to-dita/scripts/chunk_to_dita.py extracted.json ./output --map-title "My Guide"

Stage 3 — Validate (`validate_output.py`)

Calls validate_dita_topic.py and validate_ditamap.py on every generated file.

python pdf-to-dita/scripts/validate_output.py ./output

Output excerpt:

{
  "summary": {
    "total_files": 8, "valid": 7, "invalid": 1,
    "total_errors": 2, "total_warnings": 5
  },
  "files": [
    { "file": "topics/installation.dita", "is_valid": true, "errors": [], "warnings": [] }
  ]
}

Stage 4 — Optimize (`optimize_dita.py`)

Calls best_practices.py on each topic, auto-fixes common violations, then re-validates.

Auto-fixes applied:

Violation	Fix
`missing-shortdesc`	Inserts `<shortdesc>` from first sentence of first paragraph

Remaining errors and warnings after auto-fix are reported but not auto-corrected (require human review).

python pdf-to-dita/scripts/optimize_dita.py ./output

Stage 5 — Review (`review_dita_guide.py`)

Runs the full hierarchical guide review via review-dita-guide, traversing the generated map → topics and checking for:

Missing href targets
Duplicate topic @id values across the guide
Cross-file structural issues

python review-dita-guide/scripts/review_dita_guide.py ./output/root.ditamap --best-practices

Running the full pipeline

# Full pipeline (all 5 stages)
python pdf-to-dita/scripts/pipeline.py input.pdf ./output

# With custom map title
python pdf-to-dita/scripts/pipeline.py input.pdf ./output --map-title "My Product Guide"

# Skip Stage 5 for faster iteration
python pdf-to-dita/scripts/pipeline.py input.pdf ./output --skip-review

# Resume from extracted JSON (skip PDF re-extraction)
python pdf-to-dita/scripts/pipeline.py --from-json extracted.json ./output

# Full verbose report
python pdf-to-dita/scripts/pipeline.py input.pdf ./output --format full

Summary report output:

{
  "overall": "ok",
  "output_dir": "./output",
  "stages": [
    { "stage": "extract",  "status": "ok", "elapsed_s": 3.2 },
    { "stage": "chunk",    "status": "ok", "elapsed_s": 0.4 },
    { "stage": "validate", "status": "warnings", "elapsed_s": 1.1 },
    { "stage": "optimize", "status": "ok", "elapsed_s": 1.3 },
    { "stage": "review",   "status": "ok", "elapsed_s": 2.0 }
  ],
  "validation":   { "total_files": 8, "valid": 8, "total_errors": 0, "total_warnings": 3 },
  "optimization": { "total_topics": 7, "fixes_applied": 4, "remaining_errors": 0 },
  "review":       { "maps": 1, "topics": 7, "errors": 0, "warnings": 3 }
}

Exit codes

Code	Meaning
`0`	All stages succeeded, no validation errors
`1`	Validation errors found (DITA invalid)
`2`	Pipeline failure (PDF read error, missing dependency)

Integration with other skills

Skill	Role in pipeline
`validate-dita-topic`	Stage 3 — validates each generated `.dita`
`validate-ditamap`	Stage 3 — validates the generated `root.ditamap`
`dita-best-practices`	Stage 4 — detects and auto-fixes best-practice violations
`review-dita-guide`	Stage 5 — cross-guide structural review
`generate-dita-topic`	Reference — topic templates match Stage 2 output format
`generate-ditamap`	Reference — map template matches Stage 2 output format
`refactor-dita-content`	Post-pipeline — split oversized topics, add conref

Installation

# Required for Stage 1 (PDF extraction)
pip install pdfplumber

# Or install with the project optional dependency group
pip install "dita-skills[pdf]"

Common issues

Problem	Cause	Fix
All topics are `concept`	PDF uses uniform font size (no headings detected)	Increase `--heading-ratio` threshold or pre-process PDF
Empty `<steps>` in task topics	Steps detected but content was plain text	Manually convert paragraphs to `<cmd>` elements
`KeyError: 'DITA_OT_DIR'` in review stage	DITA-OT not configured	Set `DITA_OT_DIR` env var or skip review with `--skip-review`
`ImportError: pdfplumber`	pdfplumber not installed	`pip install pdfplumber`
Garbled text from scanned PDF	PDF contains images, not selectable text	Run OCR first (e.g., `ocrmypdf input.pdf ocr_output.pdf`)
Short-desc truncated mid-word	First sentence > 50 words	Manually edit `<shortdesc>` after pipeline run

Overview

Pipeline stages

Stage 1 — Extract (extract_pdf.py)

Stage 2 — Chunk (chunk_to_dita.py)

Stage 3 — Validate (validate_output.py)

Stage 4 — Optimize (optimize_dita.py)

Stage 5 — Review (review_dita_guide.py)