pdf-smart-split - SKILL.md Agent Skill

name: pdf-smart-split description: Intelligent PDF splitting skill. Split multi-page scanned PDF documents into separate files based on content analysis. Features: (1) Convert PDF to high-quality images (2) Analyze and intelligently name images by content (3) Auto-detect and remove duplicate pages (4) Identify and reorder scrambled pages (5) Merge related images into separate PDFs by content (6) Generate summary reports. Use for: bond filing documents, application materials, contracts and agreements that need intelligent content-based splitting.

PDF Smart Split Skill

Split multi-page scanned PDF documents into separate files based on content analysis. Supports duplicate detection, page reordering, and intelligent naming.

Workflow

1. PDF to Images

from pdf2image import convert_from_path
images = convert_from_path(pdf_path, dpi=200, thread_count=4)

DPI=200 for quality
Multi-threading for speed

2. Duplicate Detection

import hashlib
def get_image_hash(path):
    with open(path, 'rb') as f:
        return hashlib.md5(f.read()).hexdigest()

Use MD5 hash comparison
Record duplicates for summary

3. Content Analysis and Naming

Use view_file tool to analyze each image:

File type (cover, content, signature page)
Document category (application, agreement, statement)
Page sequence number

Naming format: sequence_filetype_description.jpg

4. Page Reordering

Analyze content continuity:

Check table sequence numbers
Check chapter numbering progression
Ensure signature pages follow content

Reorder by logical content, not physical order.

5. Merge Images to PDF

from PIL import Image
def images_to_pdf(image_paths, output_pdf):
    images = [Image.open(p).convert('RGB') for p in image_paths]
    images[0].save(output_pdf, "PDF", save_all=True, append_images=images[1:])

Merge signature pages with their content
Combine continuous content into single files

6. Generate Summary

Summary should include:

Source file info (name, original pages)
Duplicate removal records (if any)
Page reorder records (if any)
Split file list (name, original pages, page count, description)
Output directory paths

Script Usage

Full Processing

python scripts/convert_dedup.py <pdf_path> [output_dir]
python scripts/merge_images.py <image1> <image2> ... <output.pdf>

Output Structure

filename_images/     # Intelligently named images
filename_split_PDF/  # Content-grouped PDFs

Requirements

Install: pdf2image, Pillow
Windows needs poppler: conda install -c conda-forge poppler