docx-to-md - SKILL.md Agent Skill

name: docx-to-md description: Word 文档转 Markdown。Use when (1) 将 .docx 转换为 .md, (2) 提取 Word 文档内容, (3) 批量转换文档, (4) 保留格式和图片, (5) 自动化文档处理

DOCX to Markdown Converter

Objectives

Convert Word documents (.docx) to Markdown (.md)
Preserve formatting (headings, lists, tables, bold, italic)
Extract and save embedded images
Handle batch conversions
Support template-based conversions

Core Strategy

1. Use Mammoth (Python)

Mammoth is the recommended tool:

Pure Python solution - no external dependencies
Good conversion quality
Easy to customize
Works cross-platform
CRITICAL: Preserves original content exactly - no modifications

2. Python with Mammoth

CRITICAL: Only convert format, never modify content.

import mammoth
from pathlib import Path

def docx_to_md_mammoth(docx_path: Path, md_path: Path):
    """
    Convert DOCX to Markdown using mammoth.
    
    CRITICAL: This function only converts format.
    It does NOT modify any content, titles, or text.
    """
    with open(docx_path, 'rb') as docx_file:
        result = mammoth.convert_to_markdown(docx_file)
        md_path.write_text(result.value, encoding='utf-8')
    
    # Print warnings
    for message in result.messages:
        print(f"Warning: {message}")
    
    return md_path

3. Installation

# Install mammoth
uv add mammoth

4. Batch Conversion

from pathlib import Path
import mammoth

def batch_convert(input_dir: Path, output_dir: Path):
    """Convert all DOCX files in directory."""
    output_dir.mkdir(parents=True, exist_ok=True)
    
    for docx_file in input_dir.glob('*.docx'):
        if docx_file.name.startswith('~$'):  # Skip temp files
            continue
        
        md_file = output_dir / f"{docx_file.stem}.md"
        
        with open(docx_file, 'rb') as f:
            result = mammoth.convert_to_markdown(f)
            md_file.write_text(result.value, encoding='utf-8')
        
        print(f"✓ Converted: {docx_file.name} -> {md_file.name}")

Common Patterns

Pattern 1: Convert with Image Insertion

import subprocess
from pathlib import Path

def convert_with_images(docx_path: Path, output_dir: Path):
    """Convert DOCX and organize images."""
    md_path = output_dir / f"{docx_path.stem}.md"
    images_dir = output_dir / 'images'
    
    # Convert
    subprocess.run([
        'pandoc',
        str(docx_path),
        '-o', str(md_path),
        '--extract-media', str(images_dir)
    ])
    
    # Update image paths in markdown
    content = md_path.read_text(encoding='utf-8')
    content = content.replace('](media/', '](images/')
    md_path.write_text(content, encoding='utf-8')
    
    return md_path

Pattern 2: Template-Based Conversion

def convert_lab_template(docx_path: Path, output_path: Path):
    """Convert lab answer document with specific formatting."""
    # Convert
    subprocess.run([
        'pandoc',
        str(docx_path),
        '-o', str(output_path),
        '--wrap=none',
        '--atx-headers'
    ])
    
    # Post-process: Add front matter
    content = output_path.read_text(encoding='utf-8')
    
    front_matter = """---
title: Lab Assignment
author: Your Name
date: 2026-01-22
---

"""
    
    output_path.write_text(front_matter + content, encoding='utf-8')

Pattern 3: Auto-Insert Screenshots

def insert_screenshots(md_path: Path, images_dir: Path):
    """Automatically insert screenshots into markdown."""
    content = md_path.read_text(encoding='utf-8')
    
    # Find all image files
    images = sorted(images_dir.glob('*.png'))
    
    # Insert images at appropriate locations
    for i, img in enumerate(images, 1):
        marker = f"<!-- INSERT_IMAGE_{i} -->"
        if marker in content:
            img_markdown = f"![Step {i}]({img.relative_to(md_path.parent)})\n"
            content = content.replace(marker, img_markdown)
    
    md_path.write_text(content, encoding='utf-8')

Installation

# Install mammoth
uv add mammoth

Common Issues

Images not extracted → Mammoth converts images to base64 embedded in markdown

Chinese characters garbled → Ensure UTF-8 encoding when writing file

Tables broken → Mammoth handles simple tables, complex tables may need manual adjustment

Formatting lost → Mammoth preserves basic formatting (bold, italic, headings, lists)

Temp files (~$*.docx) → Skip files starting with ~$ in batch processing

Helper Scripts

Use provided scripts for common tasks:

# Convert single file
uv run python .skills/dev-docx_to_md/scripts/convert_docx_mammoth.py input.docx output.md

# Batch convert
uv run python .skills/dev-docx_to_md/scripts/batch_convert.py input_dir/ output_dir/

References

For detailed examples: See references/examples.md

For pandoc filters: See references/filters.md

For troubleshooting: See references/troubleshooting.md

Quick Reference

# Install
uv add mammoth

# Convert single file (Python)
uv run python -c "import mammoth; print(mammoth.convert_to_markdown(open('input.docx', 'rb')).value)" > output.md

# Or use helper script
uv run python .skills/dev-docx_to_md/scripts/convert_docx_mammoth.py input.docx output.md