name: doc_processor description: A comprehensive tool for parsing, cleaning, generating content for, and reconstructing MS Word (.docx) documents.
Doc Processor Skill
This skill allows you to "re-architect" a Word document. It can extract the deep structure, wipe content to create a template, generate new content based on rules or AI, and refill the document.
Capabilities
- Parse Structure: Extract a hierarchical JSON representation including Sections, Paragraphs, Tables, and "Slots".
- Clean Template: Create a "Clean" blank version of the document.
- Generate Content: Produce a content map based on the parsed structure and a user topic.
- Local Repository Integration: Automatically queries local question bank for authentic exam materials.
- Source Citation: All borrowed content is properly annotated with exam source information.
- Build Document: Inject content back into the Clean Template.
Usage Workflow
Task: "Rewrite this lesson plan for the topic 'Past Tense'."
⚠️ 重要:模板文件路径检查
问题背景:/tmp/目录下的文件在session结束后会被清理。如果用户提供的模板路径是/tmp/xxx.docx,在新session中可能已不存在。
解决方案:
- 生成前必须检查:使用
os.path.exists()检查模板文件是否存在 - 文件不存在时:必须询问用户提供正确的模板路径,不要假设文件存在
- 建议用户:将模板文件保存在非/tmp/目录(如
~/Documents/)
标准工作流程
Parse Original:
python skills/doc_processor/scripts/parser.py input.docx > structure.json(Optionally redirect output to file)
Create Template (Clean):
python skills/doc_processor/scripts/cleaner.py input.docx template_clean.docx⚠️ 路径保存建议:
- 清理后的模板保存在非/tmp/目录,如:
~/Documents/templates/lesson_template_clean.docx - 或保存在工作目录:
/Users/xielk/webdata/english/lesson/templates/
- 清理后的模板保存在非/tmp/目录,如:
Generate Content (The "Brain"):
- Goal: Create a
content.jsonfile that mapsstructure.jsonIDs to new content. - Process:
- Read the
structure.jsonto find the Slot IDs (p_X,t_X) and their types. - MANDATORY: Query Local Question Bank via Index System (CRITICAL CONSTRAINT)
- MUST use the Index + On-Demand Loading system to access exam questions. NEVER directly load all docx files (65MB+).
- Workflow:
- Load index file (
/Users/xielk/webdata/english/lesson/resource/index.json) - Search index for matching files (search filename and preview text)
- Load only the most relevant 3-5 docx files on-demand
- Extract questions with proper citations
- Load index file (
- Implementation:
from skills.doc_processor.scripts.searcher import search_question_bank # Search for questions matching topic and student profile results, questions = search_question_bank( topic="非谓语动词", # Grammar topic district="嘉定", # Student's district (priority) year="2025" # Most recent year (priority) ) # questions contains content with source annotations for q in questions: print(q['content']) # Question text print(q['source']) # Source: (2025 嘉定一模) - NEVER fabricate or hallucinate exam questions. All content MUST be sourced from the local repository.
- Citation Requirement: EVERY piece of content MUST be annotated:
(YYYY 区域 考试类型)- Examples:
(2025 徐汇一模),(2024 浦东二模),(2023 嘉定一模)
- Examples:
- Priority Rules:
- Most recent year (2025 > 2024 > 2023)
- Student's district (if specified)
- Load max 3-5 files, max 5 questions per file (control token usage)
- STRICTLY ADHERE to Rules from
.agent/rules/lesson.md:- Length Constraint: Resulting doc MUST be > 14 pages. You must generate EXTENSIVE examples, detailed logic explanations, and sufficient practice questions to meet this. Do not compress content.
- Time Duration: Content must cover a full 2-hour lesson.
- Topic Focus: Single core topic (e.g., "Prepositions") only. All examples must align.
- Structure Mapping:
- Row 1-3: Teaching Objectives & Difficulties.
- Row 6: Icebreaker/Review.
- Row 7-10: Knowledge Points (Deep Dive). This is the bulk. Use "Methodology + Logic" style (When/Why/Trap/How).
- Row 15: Variant Practice (Part A: Drill, Part B: Application).
- Row 17: Class Quiz (Part A: Real Exams, Part B: Extension).
- Row 18: Reflection.
- Exam Alignment: Use tags like
(2023 Shanghai Zhongkao)or(2024 Pudong Model). - Formatting: No Markdown symbols (
**,|), use____for blanks.
- Synthesize Content:
- Write a JSON file where Keys = IDs, Values = Strings (or Arrays for Tables).
- Ensure all exam questions, reading passages, and reference materials include proper source citations as specified above.
- Read the
- Action: Save the result to
content.json.
- Goal: Create a
Build Final Doc: Run the builder script to inject your generated content into the clean template.
python skills/doc_processor/scripts/builder.py <path_to_clean_template_docx> <path_to_content_json> <path_to_final_docx>⚠️ 异常处理流程:
如果模板文件不存在(FileNotFoundError),必须执行以下流程:
import os template_path = "/tmp/xxx.docx" # 用户提供的路径 if not os.path.exists(template_path): # 1. 报告错误 print(f"❌ 模板文件不存在: {template_path}") # 2. 解释原因 print("可能原因:") print(" • /tmp/目录文件在session结束后被清理") print(" • 文件路径错误") print(" • 文件被移动或删除") # 3. 询问用户 print("\n💡 请提供正确的模板文件路径:") print(" 建议将模板复制到非/tmp/目录,如 ~/Documents/templates/") # 4. 等待用户提供新路径(在对话中) # 不要继续生成,避免生成格式错误的文档!在新session中的处理流程:
用户:帮我生成教案,模板是 /tmp/template.docx 助手:检查文件是否存在... 如果发现文件不存在: "⚠️ 模板文件 /tmp/template.docx 不存在! /tmp/目录下的文件会在session结束后被清理。 请提供正确的模板路径,或者重新上传模板文件。 建议将模板保存在 ~/Documents/ 目录下。" 用户:(提供新路径或重新上传) 助手:(使用正确的路径继续生成)
Scripts Reference
scripts/parser.py: Analyzes structure. Returns valid JSON.scripts/cleaner.py: Wipes content cells/paragraphs.scripts/generator.py: Optional mock script. In real usage, the Agent generates thecontent.json.scripts/builder.py: Fills blocks by ID. Matches iteration order ofparser.py.
Local Question Bank Integration (强制约束)
Repository Path Configuration
Default Path: /Users/xielk/webdata/english/lesson/resource
This directory contains authentic exam materials organized by:
- District (区):
徐汇/,浦东/,嘉定/, etc. - Year:
2025/,2024/,2023/, etc. - Type:
一模/,二模/,中考/, etc. - Category:
语法/,阅读/,作文/, etc.
Index System (索引+按需加载)
解决大文件问题: 题库总计约65MB,直接加载所有docx会产生巨大token费用。使用索引+按需加载机制:
1. 生成索引(首次使用或更新题库时执行)
# 创建索引(只需执行一次,约10秒)
python skills/doc_processor/scripts/indexer.py
索引文件位置: /Users/xielk/webdata/english/lesson/resource/index.json
索引包含:
- 文件路径、文件名
- 年份、区域、考试类型、题型(自动解析)
- 预览内容(前500字符)
- 文件大小、修改时间
2. 搜索使用方式
方式A:使用Searcher类(推荐)
from skills.doc_processor.scripts.searcher import QuestionBankSearcher
# 初始化(加载索引,token极少)
searcher = QuestionBankSearcher()
# 搜索索引(仅查索引,不加载docx)
results = searcher.search(
keyword="非谓语", # 关键词
district="徐汇", # 可选:区域筛选
year="2025", # 可选:年份筛选
limit=10 # 返回结果数
)
# 智能搜索(索引+按需加载docx)
idx_results, questions = searcher.smart_search(
topic="非谓语",
district="嘉定", # 优先学生所在区
year="2025",
max_docs=3, # 最多加载3个文件
max_questions_per_doc=5 # 每个文件最多5题
)
# questions中包含题目内容和来源标注
for q in questions:
print(q['content']) # 题目内容
print(q['source']) # 来源:(2025 嘉定一模)
方式B:便捷函数
from skills.doc_processor.scripts.searcher import search_question_bank
# 一键搜索
results, questions = search_question_bank(
topic="定语从句",
district="浦东",
year="2024"
)
3. Token费用对比
| 方式 | Token消耗 | 说明 |
|---|---|---|
| 直接加载所有docx(65MB) | 巨大 | ❌ 不推荐 |
| 预转txt后全文搜索 | 大 | ⚠️ 稍好但仍贵 |
| 索引+按需加载 | 极小 | ✅ 只加载需要的3-5个文件 |
Search Strategy (MUST FOLLOW)
使用索引系统进行搜索:
- 加载索引(token极少,一次性)
- 搜索索引(匹配文件名和预览内容)
- 按需加载(只加载最相关的3-5个docx文件)
- 提取题目(带来源标注)
具体步骤:
# Step 1: 确保索引已创建
python skills/doc_processor/scripts/indexer.py
# Step 2: 在Python中使用Searcher搜索
python << 'PYEOF'
from skills.doc_processor.scripts.searcher import search_question_bank
# 搜索语法题目(优先嘉定区2025年)
results, questions = search_question_bank("非谓语", "嘉定", "2025")
# 搜索阅读材料
results, passages = search_question_bank("阅读B篇", "徐汇", "2024")
# 搜索作文范文
results, compositions = search_question_bank("中考作文", None, "2023")
PYEOF
Source Citation Format (强制标注)
Every piece of content extracted from the repository MUST include source annotation:
Format: (YYYY 区域 考试类型 [题型])
Examples:
(2025 徐汇一模 语法单选)- 2025 Xuhui District First Mock Exam, Grammar MCQ(2024 浦东二模 阅读B篇)- 2024 Pudong District Second Mock Exam, Reading Passage B(2023 Shanghai Zhongkao 作文)- 2023 Shanghai High School Entrance Exam, Composition(2024 Jiading Model 完形填空)- 2024 Jiading District Mock Exam, Cloze Test
Placement:
- Place citation immediately after the question title or passage title
- Example:
【例题1】选择最佳答案(2025 徐汇一模 语法单选) The problem ______ at the meeting tomorrow is important. A. to be discussed B. being discussed C. discussed D. to discuss
Priority Rules
When multiple sources are available, select in this order:
- Recency: Prioritize 2025 over 2024 over 2023
- Student's District: If student is from Jiading, use Jiading papers first
- Difficulty Match: Select materials matching student's current level (98分 → medium difficulty, avoid too basic)
- Topic Relevance: Exact topic match > Related topic > General review
Error Handling
If required content is NOT found in the repository:
- Expand search to adjacent years (e.g., if 2025 not found, try 2024)
- Expand search to other districts (e.g., if 徐汇 not found, try 浦东)
- If still not found, inform user: "未在题库中找到[具体年份/区域]的相关题目,已使用[替代来源]的相似题目替代"
- NEVER fabricate exam questions or pretend they exist in the repository
Content Types to Search
- Grammar Questions: 单选题, 填空题, 改错题, 完成句子
- Reading Materials: A篇应用文, B篇记叙文, C篇首字母填空, D篇回答问题
- Compositions: 中考作文范文, 满分作文, 常见话题模板
- Vocabulary: 考纲词汇, 高频短语, 固定搭配
Shanghai Zhongkao Question Type Structure (上海中考题型结构)
必须理解上海中考英语试卷结构(与其他地区不同):
| 题型 | 内容 | 分值 | 特点 |
|---|---|---|---|
| Part 1 | 听力 | 30分 | 短对话、长对话、短文 |
| Part 2 | 语音/语法/词汇 | 40分 | 语音、词汇变形、语法选择 |
| Part 3 | 阅读理解 | 50分 | A/B/C/D四篇 |
| - A篇 | 应用文阅读 | 约12分 | 广告、通知、指南,3-4题选择题 |
| - B篇 | 记叙文阅读 | 约12分 | 故事类,3-4题选择题 |
| - C篇 | 首字母填空 | 14分 | ⚠️ 不是选择题! 首字母提示填空(7空×2分) |
| - D篇 | 回答问题 | 12分 | 阅读后回答问题(6题) |
| Part 4 | 写作 | 20分 | 命题作文(80-100词) |
⚠️ 常见错误警示:
❌ 错误理解: C篇是阅读理解选择题(这是全国卷题型) ✅ 正确理解: 上海中考C篇是首字母填空(Cloze with initial letters)
C篇特点:
- 给出一篇150-200词的短文
- 7个空格,每空首字母已给出
- 需根据上下文和首字母填入正确单词
- 考点:词汇拼写、语法搭配、上下文逻辑
搜索关键词对照:
- C篇 / 首字母填空 / 首字母
- 不是:阅读理解 / 阅读C篇 / 选择题