pdf-ocr-layout

name: pdf-ocr-layout description: Multimodal document deep analysis tool based on Zhipu GLM-OCR, GLM-4.7, and GLM-4.6V.

Use when:

Need to extract tables from documents (PDF/images) with high precision and convert to Markdown format
Need to automatically crop and extract illustrations and charts from document pages as independent files
Need to perform deep semantic understanding on extracted charts (based on GLM-4.6V visual analysis)
Need to perform logical analysis on extracted table data (based on GLM-4.7 text analysis)

Core Architecture: 1. Visual Extraction: GLM-OCR 2. Semantic Understanding: GLM-4.7 (text/tables) + GLM-4.6V (multimodal/images)

GLM-OCR多模态深度分析工具

该工具构建了一个高精度的文档解析流程：使用GLM-OCR提取文档的布局元素，调用GLM-4.7对表格数据进行逻辑解析，再使用GLM-4.6V对图像和图表进行多模态视觉分析。

流程实现架构

该工具包含两个核心脚本阶段，这些阶段通过glm_ocr_pipeline.py进行协调：

1. 提取阶段 (`scripts/glm_ocr_extract.py`)

核心模型：GLM-OCR
功能：负责对文档的物理布局进行分析
输出：提取表格的HTML内容并将其转换为Markdown格式；根据Bbox坐标自动裁剪独立的图表图像文件；生成包含整页阅读顺序的中间JSON文件

2. 理解阶段 (`scripts/glm_understanding.py`)

核心模型：GLM-4.7（文本）/ GLM-4.6V（视觉）
功能：负责对内容进行深度语义理解
逻辑：
- 表格：结合完整的文本上下文，使用GLM-4.7分析Markdown表格数据的业务含义
- 图表：结合完整的文本上下文和裁剪后的图像，使用GLM-4.6V进行多模态视觉分析

调用方法

命令行调用

# Run complete pipeline: extraction -> cropping -> understanding analysis, supports input in .pdf, .jpg, .png and other formats
python scripts/glm_ocr_pipeline.py \
  --file_path "/data/report_page.jpg" \
  --output_dir "/data/output"

API参数说明

参数	类型	是否必填	说明
file_path	字符串	✅	输入文件的绝对路径（支持.pdf、.png、.jpg格式）
output_dir	字符串	✅	结果输出目录（用于保存裁剪后的图像和JSON报告）

返回结果结构（JSON）

该工具返回一个列表，其中包含布局元素及其深度分析结果：

[
  {
    "type": "table",
    "bbox": [100, 200, 500, 600],
    "content_info": "| Revenue | Q1 |\n|---|---|\n| 100M | ... |",
    "deep_understanding": "(Generated by GLM-4.7) This table shows Q1 2024 revenue data. Combined with the 'market expansion strategy' mentioned in paragraph 3 of the body text, it can be seen that..."
  },
  {
    "type": "image",
    "bbox": [100, 700, 500, 900],
    "content_info": "/data/output/images/report_page_img_2.png",
    "deep_understanding": "(Generated by GLM-4.6V) This is a system architecture diagram. Visually, it shows the flow of clients connecting to servers through a Load Balancer. Combined with the title 'Fig 3' and context, this diagram is mainly used to illustrate..."
  }
]

环境要求

必须配置环境变量ZHIPU_API_KEY
Python 3.8及以上版本
所需依赖库：zhipuai、pillow、beautifulsoup4

注意事项

1. 模型处理策略

表格：将内容传递给GLM-4.7，并结合完整的Markdown文本上下文进行逻辑分析
图像：将图像进行Base64编码后传递给GLM-4.6V，并结合OCR提取的标题及完整的文本上下文进行多模态分析

2. 上下文关联

所有的理解过程都基于文档的完整布局逻辑（Markdown上下文），而不是孤立片段的分析

3. PDF处理

对于多页PDF文件，系统默认处理第一页。如需批量处理，请在脚本层面扩展循环逻辑。