name: document-parsing description: "Parse documents to extract structure, headings, content, and metadata from PDF, DOCX, PPTX, XLSX, images, audio, HTML, and ZIP files. Use Markitdown for format conversion. Activate when user mentions parse, extract, document structure, or provides files for analysis." allowed-tools: read, write, grep, glob, task
Document Parsing Skill
Purpose
Expert document parsing specialist that extracts, analyzes, and structures information from various document formats, with primary focus on preserving complete heading hierarchy for downstream analysis.
When to Activate This Skill
Claude should activate this skill when the user:
- Mentions "parse", "extract", "analyze document", "get structure"
- Wants to "extract headings", "get sections", "find structure"
- Provides document files for processing
- Needs "document breakdown", "content extraction"
- Wants metadata, tables, or figures extracted
- Mentions any supported file format for analysis
Key Triggers: parse, extract, structure, headings, sections, document breakdown, analyze document
Supported File Formats
Direct Parsing
- PDF - Research papers, reports, technical documentation
- TEXT/MD - Plain text, markdown files
- CSV - Data tables
Via Markitdown MCP Integration
- DOCX/PPTX/XLSX - Office documents (ALWAYS use Markitdown)
- JPG/PNG/GIF - Images with text (OCR via Markitdown)
- MP3/WAV - Audio files (transcription via Markitdown)
- HTML - Web content (clean conversion via Markitdown)
- ZIP - Archives containing multiple files (batch process via Markitdown)
Core Capabilities
1. Document Structure Analysis
- Identify document type (research paper, guideline, spec, etc.)
- Extract hierarchical structure (sections, subsections, headings)
- Recognize key components (abstract, methods, results, conclusions)
- Map relationships between sections
2. Heading Extraction (CRITICAL)
This is the most important responsibility:
- Extract ALL heading levels (H1, H2, H3, H4, H5, H6)
- Preserve exact text and formatting
- Maintain parent-child hierarchy
- Record page numbers and locations
- Create complete Document Structure Map
Why Critical: All downstream agents depend on heading extraction for:
- Referencing source material accurately
- Organizing analyses logically
- Maintaining traceability
- Enabling readers to verify claims
3. Content Extraction
- Extract all text content with formatting preservation
- Identify and extract tables, figures, charts
- Capture citations, references, footnotes
- Extract metadata (authors, dates, DOI, version)
4. Semantic Understanding
- Identify key concepts and terminology
- Recognize domain-specific language (AI/ML, clinical terms)
- Extract definitions, equations, formulas
- Identify acronyms and expansions
How to Use This Skill
Step 1: Pre-Processing & Format Conversion
Detect File Format:
IF file is DOCX/PPTX/XLSX:
→ ALWAYS use Markitdown to convert to markdown
→ Save to `PDFs/[filename].md`
→ Proceed with markdown parsing
ELSE IF file is JPG/PNG/GIF:
→ ALWAYS use Markitdown for OCR
→ Extract text to `PDFs/[filename].md`
→ Proceed with markdown parsing
ELSE IF file is MP3/WAV:
→ Use Markitdown for transcription
→ Save transcript to `PDFs/[filename].md`
→ Proceed with markdown parsing
ELSE IF file is HTML:
→ Use Markitdown to clean and convert
→ Save to `PDFs/[filename].md`
→ Proceed with markdown parsing
ELSE IF file is ZIP:
→ Use Markitdown to process archive
→ Save combined output to `PDFs/[filename].md`
→ Proceed with markdown parsing
ELSE IF file is PDF AND parsing difficult:
→ Try Markitdown as alternative
→ Save to `PDFs/[filename].md`
→ Proceed with markdown parsing
ELSE:
→ Proceed directly to extraction (already parseable)
Step 2: Extraction Phase
Extract Complete Content:
- Read document thoroughly
- Identify all headings and their levels
- Extract full text under each heading
- Capture tables and figures with descriptions
- Extract all metadata
- Identify key terminology
Step 3: Create Document Structure Map
Generate Complete Heading Hierarchy:
## Document Structure Map
### Complete Heading Hierarchy
1. [H1] Introduction (Page 1)
- **Content summary**: [1-2 sentences]
1.1 [H2] Background (Page 1)
- **Content summary**: [1-2 sentences]
1.2 [H2] Research Objectives (Page 2)
- **Content summary**: [1-2 sentences]
1.2.1 [H3] Primary Objectives (Page 2)
- **Content summary**: [1-2 sentences]
[Continue for ALL headings...]
### Heading Statistics
- **Total Headings**: [count]
- **H1 Level**: [count]
- **H2 Level**: [count]
- **H3 Level**: [count]
- **Maximum Nesting Depth**: [number]
### Section Navigation Guide
- **Clinical Sections**: [List]
- **Technical Sections**: [List]
- **Mixed Sections**: [List]
Step 4: Generate Structured Output
Create comprehensive parsed output with:
- Document metadata block
- Document summary
- Key information section
- Document Structure Map (CRITICAL - see above)
- Extracted content organized by sections
- References and citations
- Extracted terminology
- Parsing notes
Step 5: Save Output
Save to: PDFs/parsed/[filename]_parsed.md
Output Format
# DOCUMENT PARSING REPORT
## Document Metadata
- **Original Filename**: [filename]
- **Document Type**: [type]
- **Pages/Length**: [count]
- **Date Parsed**: [YYYY-MM-DD]
- **Parser Version**: 1.0
- **Conversion Method**: [Markitdown/Direct/None]
## Document Summary
[Brief 2-3 sentence overview]
## Key Information
- **Primary Topic**: [topic]
- **Domain**: [AI/Clinical/Mixed]
- **Target Audience**: [audience]
- **Document Purpose**: [purpose]
## Document Structure Map
[CRITICAL SECTION - Complete heading hierarchy as shown in Step 3]
## Extracted Content
### Section 1: [Title]
[Full text content]
**Key Concepts**: [list]
**Important Terms**: [list]
**Figures/Tables**: [descriptions]
[Continue for all sections...]
## References & Citations
[All references]
## Extracted Terminology
[Alphabetical list of key terms]
## Extraction Notes
[Parsing challenges, recommendations]
Integration with Other Skills
Downstream Consumers (skills that use your output):
ai-technical-analysis- Uses your heading hierarchy for technical workflowsclinical-analysis- Uses your heading hierarchy for clinical workflowsmindmap-creation- Uses your headings as node labelsdocumentation-generation- Uses your structure for organizing documentationblog-publishing- Uses your headings for citations
Your heading extraction is the foundation for all downstream analysis.
Quality Assurance Checklist
Before finalizing output:
- All visible headings extracted (checked against source)
- Hierarchy preserved correctly (parent-child intact)
- Page numbers accurate for all headings
- Exact text captured (no paraphrasing)
- Content summaries provided for each heading
- No headings skipped or missed
- Heading statistics calculated correctly
- Structure map formatted clearly
- Section navigation guide completed
- Document metadata complete
- Parsing notes include conversion method used
Special Handling Cases
No Clear Headings
If document lacks formal headings:
- Create logical section divisions based on content
- Note in extraction notes: "No formal headings - structure inferred"
- Use topic sentences or paragraph breaks as markers
- Clearly indicate these are inferred, not original
Inconsistent Hierarchy
If heading levels skip or are inconsistent:
- Preserve as found in original (document reality over ideal)
- Note inconsistency in extraction notes
- Suggest normalized structure for generated outputs
- Maintain original numbering/labeling system
Very Long Documents (>50 pages)
For extensive documents:
- Extract all headings (don't skip for length)
- Consider grouping by major sections (Parts, Chapters)
- May abbreviate content summaries to 1 sentence
- Note document length in parsing notes
Complex Formatting
For unusual structures:
- Note formatting in extraction notes
- Preserve original numbering schemes (Roman numerals, letters, etc.)
- Document special formatting in content summaries
- Maintain any color-coding in text descriptions
Best Practices
- Accuracy First: Preserve original content exactly as written
- Complete Extraction: Don't skip sections or headings
- Context Preservation: Maintain relationships between sections
- Clear Markers: Use consistent section markers
- Metadata Rich: Include all available metadata
- Error Handling: Note any parsing issues clearly
- Terminology Capture: Extract all domain-specific terms
- Reference Tracking: Maintain citation integrity
Limitations
- Parsing quality depends on source document structure
- Scanned/image-based PDFs may require Markitdown OCR
- Some complex layouts may need manual cleanup
- Conversion quality varies by file format
- Very large files (>100 pages) may take longer
Example Usage
User: "Parse this research paper and extract all the sections"
Claude (activates this skill):
- Checks file format → PDF
- Reads PDF directly (or converts via Markitdown if needed)
- Extracts complete heading hierarchy (H1-H6)
- Creates Document Structure Map
- Extracts all content, tables, figures
- Generates comprehensive parsing report
- Saves to
PDFs/parsed/[filename]_parsed.md - Returns summary of extracted structure
User: "I have this PowerPoint presentation, can you extract the content?"
Claude (activates this skill):
- Detects PPTX format
- Uses Markitdown to convert to markdown
- Saves converted output to
PDFs/presentation.md - Parses markdown content
- Extracts heading structure and content
- Generates parsing report with conversion notes
- Saves to
PDFs/parsed/presentation_parsed.md
See Also
- Agent:
@research-doc-parser - Reference: See
reference.mdfor detailed parsing workflows - MCP Server: Markitdown for file format conversion
- Outputs Used By: All downstream analysis skills