pdf-smart-split

star 45

Intelligent PDF splitting skill. Split multi-page scanned PDF documents into separate files based on content analysis. Features: (1) Convert PDF to high-quality images (2) Analyze and intelligently name images by content (3) Auto-detect and remove duplicate pages (4) Identify and reorder scrambled pages (5) Merge related images into separate PDFs by content (6) Generate summary reports. Use for: bond filing documents, application materials, contracts and agreements that need intelligent content-based splitting.

diegosouzapw By diegosouzapw schedule Updated 2/28/2026

name: pdf-smart-split description: Intelligent PDF splitting skill. Split multi-page scanned PDF documents into separate files based on content analysis. Features: (1) Convert PDF to high-quality images (2) Analyze and intelligently name images by content (3) Auto-detect and remove duplicate pages (4) Identify and reorder scrambled pages (5) Merge related images into separate PDFs by content (6) Generate summary reports. Use for: bond filing documents, application materials, contracts and agreements that need intelligent content-based splitting.

PDF Smart Split Skill

Split multi-page scanned PDF documents into separate files based on content analysis. Supports duplicate detection, page reordering, and intelligent naming.

Workflow

1. PDF to Images

from pdf2image import convert_from_path
images = convert_from_path(pdf_path, dpi=200, thread_count=4)
  • DPI=200 for quality
  • Multi-threading for speed

2. Duplicate Detection

import hashlib
def get_image_hash(path):
    with open(path, 'rb') as f:
        return hashlib.md5(f.read()).hexdigest()
  • Use MD5 hash comparison
  • Record duplicates for summary

3. Content Analysis and Naming

Use view_file tool to analyze each image:

  • File type (cover, content, signature page)
  • Document category (application, agreement, statement)
  • Page sequence number

Naming format: sequence_filetype_description.jpg

4. Page Reordering

Analyze content continuity:

  • Check table sequence numbers
  • Check chapter numbering progression
  • Ensure signature pages follow content

Reorder by logical content, not physical order.

5. Merge Images to PDF

from PIL import Image
def images_to_pdf(image_paths, output_pdf):
    images = [Image.open(p).convert('RGB') for p in image_paths]
    images[0].save(output_pdf, "PDF", save_all=True, append_images=images[1:])
  • Merge signature pages with their content
  • Combine continuous content into single files

6. Generate Summary

Summary should include:

  • Source file info (name, original pages)
  • Duplicate removal records (if any)
  • Page reorder records (if any)
  • Split file list (name, original pages, page count, description)
  • Output directory paths

Script Usage

Full Processing

python scripts/convert_dedup.py <pdf_path> [output_dir]
python scripts/merge_images.py <image1> <image2> ... <output.pdf>

Output Structure

filename_images/     # Intelligently named images
filename_split_PDF/  # Content-grouped PDFs

Requirements

  • Install: pdf2image, Pillow
  • Windows needs poppler: conda install -c conda-forge poppler
Install via CLI
npx skills add https://github.com/diegosouzapw/awesome-omni-skill --skill pdf-smart-split
Repository Details
star Stars 45
call_split Forks 14
navigation Branch main
article Path SKILL.md
More from Creator
diegosouzapw
diegosouzapw Explore all skills →