converting-epub-to-markdown

star 0

Use when converting EPUB ebooks to markdown format compatible with the books-to-skills workflow.

LeoDPraetorian By LeoDPraetorian schedule Updated 1/30/2026

name: converting-epub-to-markdown description: Use when converting EPUB ebooks to markdown format compatible with the books-to-skills workflow. allowed-tools: Bash, Read, Write, Grep, Glob, TodoWrite, AskUserQuestion

Converting EPUB to Markdown

EPUB to markdown conversion using Pandoc (recommended) or native tools (fallback), with vision-based figure descriptions.

When to Use

Use this skill when:

  • Converting EPUB ebooks to markdown for the books-to-skills workflow
  • Extracting text content from EPUB files
  • Preparing ebooks for chapter splitting and skill creation

Quick Reference

Phase Purpose Output
0. Check Pandoc Verify Pandoc installed, prompt if not Decision on method
1. Convert Pandoc EPUB→markdown (or native fallback) Raw markdown
2. Extract Images Unzip EPUB, count images, estimate tokens Image manifest + stats
3. Prompt for Images Ask user if they want image descriptions Decision on vision
4. Describe Figures Generate visual descriptions of images Enhanced markdown
5. Validate Verify output quality Ready for books-to-skills
6. Move to Final Copy to destination, cleanup temp files Completed conversion

Method Comparison

Aspect Pandoc Native (sed/awk)
HTML entity decoding Complete Partial (misses many)
Chapter ordering Correct (OPF spine) Alphabetical (incorrect)
Cross-references Preserved as links Plain text
Figure references Image links + IDs Plain text
Code formatting Semantic spans Basic backticks
Dependencies Requires Pandoc None
Recommendation Primary Fallback only

Workflow

Step 0: Check for Pandoc

IMPORTANT: Before starting conversion, check if Pandoc is available.

# Check if Pandoc is installed
which pandoc && pandoc --version | head -1

If Pandoc is NOT found, use the AskUserQuestion tool to prompt the user:

AskUserQuestion(
  questions: [{
    question: "Pandoc is not installed. How would you like to proceed?",
    header: "Pandoc",
    options: [
      {
        label: "Install Pandoc (Recommended)",
        description: "Run 'brew install pandoc' - provides best conversion quality with correct chapter ordering, full entity decoding, and preserved cross-references"
      },
      {
        label: "Use native conversion",
        description: "Continue with sed/awk fallback - may have incorrect chapter order, missing entities, and no cross-references"
      }
    ],
    multiSelect: false
  }]
)

If user chooses "Install Pandoc":

brew install pandoc
# Verify installation
pandoc --version | head -1

If user chooses "Use native conversion": Skip to Step 1 (Alternative) below.

Step 1: Convert with Pandoc (Recommended)

# Set paths
EPUB_FILE="path/to/book.epub"
WORK_DIR=".tmp/epub-convert"
mkdir -p "$WORK_DIR"

# Convert EPUB to markdown
pandoc "$EPUB_FILE" -t markdown -o "$WORK_DIR/book.md"

# Check result
wc -l "$WORK_DIR/book.md"
head -100 "$WORK_DIR/book.md"

Pandoc advantages:

  • Correctly follows OPF spine for chapter ordering
  • Decodes all HTML entities (em-dashes, smart quotes, etc.)
  • Preserves internal cross-references as markdown links
  • Maintains figure IDs and image references

Step 1 (Alternative): Native Conversion (Fallback)

Use only if Pandoc is unavailable or user chose native conversion:

EPUB_FILE="path/to/book.epub"
WORK_DIR=".tmp/epub-convert"
rm -rf "$WORK_DIR"
mkdir -p "$WORK_DIR"

# Extract EPUB (it's a ZIP file)
unzip -q "$EPUB_FILE" -d "$WORK_DIR/extracted"

# Combine HTML files (WARNING: alphabetical order may be wrong)
find "$WORK_DIR/extracted" \( -name "*.xhtml" -o -name "*.html" \) | \
  grep -v nav.xhtml | sort | xargs cat > "$WORK_DIR/combined.html"

# Multi-pass conversion
sed -e 's/<h1[^>]*>/\n# /g' -e 's/<\/h1>/\n/g' \
    -e 's/<h2[^>]*>/\n## /g' -e 's/<\/h2>/\n/g' \
    -e 's/<h3[^>]*>/\n### /g' -e 's/<\/h3>/\n/g' \
    -e 's/<p[^>]*>/\n/g' -e 's/<\/p>/\n/g' \
    -e 's/<li[^>]*>/- /g' -e 's/<\/li>/\n/g' \
    -e 's/<strong[^>]*>/**/g' -e 's/<\/strong>/**/g' \
    -e 's/<em[^>]*>/_/g' -e 's/<\/em>/_/g' \
    -e 's/<code[^>]*>/`/g' -e 's/<\/code>/`/g' \
    "$WORK_DIR/combined.html" > "$WORK_DIR/pass1.md"

sed -e 's/<[^>]*>//g' \
    -e 's/&nbsp;/ /g' -e 's/&lt;/</g' -e 's/&gt;/>/g' \
    -e 's/&amp;/\&/g' -e 's/&mdash;/—/g' -e 's/&ndash;/–/g' \
    -e 's/&#160;/ /g' -e 's/&#8212;/—/g' -e 's/&#8211;/–/g' \
    -e "s/&#8217;/'/g" -e "s/&#8216;/'/g" \
    -e 's/&#8220;/"/g' -e 's/&#8221;/"/g' \
    "$WORK_DIR/pass1.md" > "$WORK_DIR/pass2.md"

awk 'BEGIN{blank=0} /^$/{blank++; if(blank<=2) print; next} {blank=0; print}' \
    "$WORK_DIR/pass2.md" > "$WORK_DIR/book.md"

Native limitations:

  • Alphabetical file sorting often produces wrong chapter order
  • Many HTML entities remain unprocessed (—, Ω, etc.)
  • Cross-references become plain text
  • Requires manual cleanup

Step 2: Extract Images and Calculate Stats

# If not already extracted (Pandoc path)
unzip -q "$EPUB_FILE" -d "$WORK_DIR/extracted"

# Create image manifest
find "$WORK_DIR/extracted" -type f \( -name "*.jpg" -o -name "*.png" -o -name "*.gif" -o -name "*.webp" \) | sort > "$WORK_DIR/images.txt"

# Count images
IMAGE_COUNT=$(wc -l < "$WORK_DIR/images.txt")
echo "Found $IMAGE_COUNT images"

# Calculate total image size and estimate tokens
TOTAL_BYTES=$(cat "$WORK_DIR/images.txt" | xargs -I{} stat -f%z "{}" 2>/dev/null | awk '{sum+=$1} END {print sum}')
TOTAL_MB=$(echo "scale=2; $TOTAL_BYTES / 1048576" | bc)
# Token estimate: ~1 token per 4 bytes for vision, plus ~100 tokens overhead per image for description
ESTIMATED_TOKENS=$(echo "scale=0; ($TOTAL_BYTES / 4) + ($IMAGE_COUNT * 100)" | bc)
echo "Total image size: ${TOTAL_MB}MB"
echo "Estimated vision tokens: $ESTIMATED_TOKENS"

Token estimation formula:

  • Vision processing: ~1 token per 4 bytes of image data
  • Description overhead: ~100 tokens per image for output
  • Sub-agent batching: Images processed in batches of 50-80 per agent

Step 3: Prompt User for Image Processing

IMPORTANT: Before processing images, prompt the user to confirm. Image processing is time-intensive and uses significant tokens.

# Display stats for user decision
echo "Image Processing Summary:"
echo "  - Images found: $IMAGE_COUNT"
echo "  - Total size: ${TOTAL_MB}MB"
echo "  - Estimated tokens: $ESTIMATED_TOKENS"
echo "  - Estimated sub-agents needed: $((($IMAGE_COUNT + 59) / 60))"

Use the AskUserQuestion tool with the calculated stats:

AskUserQuestion(
  questions: [{
    question: "Found {IMAGE_COUNT} images ({TOTAL_MB}MB, ~{ESTIMATED_TOKENS} tokens). Would you like to generate inline figure descriptions?",
    header: "Images",
    options: [
      {
        label: "Yes, process all images (Recommended)",
        description: "Generate descriptions for all {IMAGE_COUNT} images using vision. Takes ~{ESTIMATED_AGENTS} sub-agents. Provides complete coverage."
      },
      {
        label: "Yes, referenced images only",
        description: "Only process images referenced in the markdown. Faster but may miss some figures."
      },
      {
        label: "No, skip image processing",
        description: "Convert text only. Image references will remain as links without descriptions."
      }
    ],
    multiSelect: false
  }]
)

If user chooses "Yes, process all images": Continue to Step 4.

If user chooses "Yes, referenced images only":

# Find only images referenced in markdown
grep -oh 'images/[^)]*' "$WORK_DIR/book.md" | sort -u > "$WORK_DIR/referenced-images.txt"
REF_COUNT=$(wc -l < "$WORK_DIR/referenced-images.txt")
echo "Processing $REF_COUNT referenced images (of $IMAGE_COUNT total)"
# Use referenced-images.txt instead of images.txt in Step 4

If user chooses "No, skip image processing": Skip to Step 5 (Validate).

Step 4: Generate Figure Descriptions (Vision via Sub-Agents)

Technical books contain valuable visual information that neither Pandoc nor native conversion can extract: schematics, die photos, memory maps, code listings rendered as images.

IMPORTANT: Reading images fills context quickly. A book with 200+ images will exhaust any single agent's context. Use sub-agents for batch processing.

4.1 Create Image Manifest

# Find all images and create numbered manifest
find "$WORK_DIR/extracted" -type f \( -name "*.jpg" -o -name "*.png" \) | sort > "$WORK_DIR/images.txt"
wc -l "$WORK_DIR/images.txt"

# Create output directory for descriptions
mkdir -p "$WORK_DIR/descriptions"

4.2 Sub-Agent Batch Processing Pattern

Spawn sub-agents to process images in batches. Each agent processes as many as it can before context fills, then reports progress.

Parent agent workflow:

1. Read images.txt to get total count
2. Set START_INDEX=1
3. While START_INDEX <= TOTAL_IMAGES:
   a. Spawn sub-agent with Task tool (see prompt below)
   b. Agent returns: {"last_processed": N, "output_file": "descriptions/batch-N.md"}
   c. Set START_INDEX = N + 1
4. Concatenate all batch files into final descriptions

Sub-agent prompt template:

You are processing images for EPUB figure descriptions.

IMAGE_DIR: {work_dir}/extracted/OEBPS/images/
MANIFEST: {work_dir}/images.txt
START_INDEX: {start_index}
OUTPUT_FILE: {work_dir}/descriptions/batch-{start_index}.md

TASK:
1. Read the manifest file to get the list of images
2. Starting from image #{start_index}, read each image using the Read tool
3. For each image, generate a description (50-100 words) in this format:

   > **Figure {number}: {title}**
   > {description focusing on technical details, components, attack points}

4. Write descriptions to OUTPUT_FILE as you go
5. Process as many images as you can before context limits
6. When stopping, report back:
   - Last image index successfully processed
   - Path to output file
   - Count of images described in this batch

IMAGE TYPE GUIDANCE:
| Type           | Focus on                                       |
| -------------- | ---------------------------------------------- |
| Die photo      | Memory arrays, logic blocks, bond pads, scale  |
| Schematic      | Key components, connections, attack points     |
| Memory map     | Address ranges, blocks, vulnerabilities        |
| Code listing   | Language, purpose, key operations (transcribe) |
| Oscilloscope   | Signals, timing, glitch characteristics        |
| Photo          | Hardware setup, probe points, modifications    |

CRITICAL: If you cannot process more images due to context, STOP and report
your progress. The parent will spawn another agent to continue.

Example Task tool invocation:

Task(
  subagent_type: "general-purpose",
  model: "sonnet",  # Use Sonnet for cost efficiency - image description is straightforward
  description: "Process images 1-50 for descriptions",
  prompt: "[prompt above with variables filled in]"
)

Model selection: Use model: "sonnet" for image processing sub-agents. Sonnet provides excellent image description quality at lower token cost than Opus. Testing showed no quality degradation for this task.

4.3 Graceful Resumption

The parent agent tracks progress and spawns new sub-agents as needed:

# Tracking state
TOTAL_IMAGES=$(wc -l < "$WORK_DIR/images.txt")
PROCESSED=0

# Loop until all processed
while [ $PROCESSED -lt $TOTAL_IMAGES ]; do
  # Spawn sub-agent starting at PROCESSED+1
  # Sub-agent returns last_processed
  # Update PROCESSED = last_processed
done

# Combine all batch files
cat "$WORK_DIR/descriptions/batch-*.md" > "$WORK_DIR/all-descriptions.md"

4.4 Prioritization (Optional)

For very large books (500+ images), prioritize key figures:

# Find which images are referenced in markdown
grep -oh 'images/[^)]*' "$WORK_DIR/book.md" | sort -u > "$WORK_DIR/referenced-images.txt"

# Process only referenced images first

4.5 Image Types and Description Focus

Image Type Description Focus
Die photo Memory arrays, logic blocks, bond pads, scale
Schematic Key components, connections, attack points
Memory map Address ranges, blocks, vulnerabilities
Code listing Language, purpose, key operations (transcribe)
Oscilloscope Signals, timing, glitch characteristics
Photo Hardware setup, probe points, modifications

Key insight: Many technical books render code listings as images. Vision can transcribe these where text extraction fails completely.

Step 5: Validate Output

# Check file size
wc -l "$WORK_DIR/book.md"

# Check for chapter markers
grep -n "^## " "$WORK_DIR/book.md" | head -20

# Check token count (approximate)
chars=$(wc -c < "$WORK_DIR/book.md")
tokens=$((chars / 4))
echo "Approximate tokens: $tokens"

# Check for remaining HTML artifacts
grep -c "<[a-zA-Z]" "$WORK_DIR/book.md"

# Preview content
head -200 "$WORK_DIR/book.md"

Step 6: Move to Final Location

OUTPUT_FILE="path/to/BookName_full.md"
cp "$WORK_DIR/book.md" "$OUTPUT_FILE"
rm -rf "$WORK_DIR"
echo "Conversion complete: $OUTPUT_FILE"

Quality Comparison

Content Type Pandoc Native With Vision Notes
Headers (h1-h6) Excellent Good - Pandoc preserves IDs
Paragraphs Excellent Good - Both work well
Lists (ul/ol) Excellent Fair - Pandoc handles nesting
Bold/Italic Excellent Good - Direct mapping
Code blocks Excellent Fair - Pandoc uses semantic spans
Tables Good Poor - Both may need cleanup
Cross-references Excellent None - Pandoc preserves links
HTML entities Complete Partial - Pandoc decodes all
Chapter order Correct Wrong - Pandoc follows OPF spine
Figures (images) Reference Reference Excellent Vision extracts semantic meaning
Code as images None None Excellent Vision can transcribe

Bottom line: Use Pandoc + Vision for best results.

Integration with books-to-skills

After conversion, use the converting-books-to-skills skill:

Read(".claude/skill-library/claude/skill-management/converting-books-to-skills/SKILL.md")

The markdown file produced by this skill is compatible with that workflow's:

  • Chapter detection (looks for ## headers)
  • TOC analysis (extracts keywords)
  • Chunking (splits into chapter files)

Troubleshooting

Problem: Pandoc not installed

brew install pandoc
# Or on Linux: apt install pandoc

Problem: Chapter order wrong (native conversion)

The native approach sorts files alphabetically. Check OPF spine for correct order:

grep -A 100 "<spine" "$WORK_DIR/extracted/OEBPS/content.opf"

Problem: HTML entities remain (native conversion)

The native approach misses many entities. Either use Pandoc or add more sed patterns:

sed -i '' -e 's/&#937;/Ω/g' -e 's/&#8194;/ /g' "$WORK_DIR/book.md"

Problem: Code blocks are images, not text

Many technical books render code as images. Use vision to transcribe:

Read("$WORK_DIR/extracted/OEBPS/images/f0148-01.jpg")

Integration

Called By

  • Manual invocation before using converting-books-to-skills
  • Workflows that need EPUB → markdown conversion

Pairs With

Skill Purpose
converting-books-to-skills Takes this output and creates skills

Changelog

See .history/CHANGELOG for version history.

Install via CLI
npx skills add https://github.com/LeoDPraetorian/praetorian-coverage-digest --skill converting-epub-to-markdown
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
LeoDPraetorian
LeoDPraetorian Explore all skills →