name: doc-convert description: "Convert local PDF/DOCX documents to text for analysis" metadata: {"aeloon": {"emoji": "๐", "requires": {"bins": ["python3"]}}}
Document Conversion
When a user references local documents or directories (e.g., "refer to ~/papers/report.pdf", "according to these files", "use the documents in ~/data/"), convert the referenced documents to text format, then read and analyze them.
Document Detection
Recognize when the user mentions:
- File paths (e.g.,
~/papers/report.pdf,/home/user/docs/notes.docx) - Directories (e.g.,
~/papers/,./docs/) - "these documents" / "these files" / "use these papers" / "refer to" / "according to"
Conversion Workflow
Step 1: Convert PDF to text
exec(command="python3 -c \"
import fitz, sys, os, pathlib
src = '{source_path}'
out_dir = os.path.expanduser('~/.aeloon/converted_docs')
os.makedirs(out_dir, exist_ok=True)
stem = pathlib.Path(src).stem
out_path = os.path.join(out_dir, stem + '.txt')
doc = fitz.open(src)
text = ''
for page in doc:
text += page.get_text() + '\n\n'
doc.close()
with open(out_path, 'w', encoding='utf-8') as f:
f.write(text.strip())
print(f'Converted: {out_path}')
print(f'Pages: {doc.page_count}, Characters: {len(text)}')
\"")
Step 2: Convert DOCX to text
exec(command="python3 -c \"
import docx, os, pathlib
src = '{source_path}'
out_dir = os.path.expanduser('~/.aeloon/converted_docs')
os.makedirs(out_dir, exist_ok=True)
stem = pathlib.Path(src).stem
out_path = os.path.join(out_dir, stem + '.txt')
doc = docx.Document(src)
lines = []
for para in doc.paragraphs:
if para.style.name.startswith('Heading'):
level = int(para.style.name.split()[-1]) if para.style.name[-1].isdigit() else 1
lines.append('#' * level + ' ' + para.text)
else:
lines.append(para.text)
text = '\n\n'.join(lines)
with open(out_path, 'w', encoding='utf-8') as f:
f.write(text.strip())
print(f'Converted: {out_path}')
\"")
Step 3: Read converted text
read_file(path="{converted_path}")
Step 4: Process directory of documents
When the user specifies a directory:
exec(command="python3 -c \"
import fitz, docx, os, pathlib, glob
src_dir = '{directory_path}'
out_dir = os.path.expanduser('~/.aeloon/converted_docs')
os.makedirs(out_dir, exist_ok=True)
supported = {'.pdf', '.docx', '.md', '.txt', '.csv'}
converted = 0
skipped = 0
for f in sorted(pathlib.Path(src_dir).rglob('*')):
if f.suffix.lower() not in supported:
continue
out_path = pathlib.Path(out_dir) / (f.stem + '.txt')
if f.suffix.lower() == '.pdf':
doc = fitz.open(str(f))
text = '\\n\\n'.join(p.get_text() for p in doc)
doc.close()
elif f.suffix.lower() == '.docx':
doc = docx.Document(str(f))
text = '\\n\\n'.join(p.text for p in doc.paragraphs)
else:
text = f.read_text(encoding='utf-8', errors='replace')
out_path.write_text(text.strip(), encoding='utf-8')
converted += 1
print(f'OK: {f.name} -> {out_path.name}')
print(f'\\nTotal: {converted} converted, {skipped} skipped')
\"")
Output Directory
- Converted text files are saved to
~/.aeloon/converted_docs/ - Files are named
<original_stem>.txt(e.g.,report.txtfromreport.pdf) - Never auto-delete converted files โ they serve as a local cache for future reference
- If the directory grows too large, the user may manually clean old files
Error Handling
If conversion fails:
- Report which file failed:
"Failed to convert {filename}: {error}" - Suggest installing dependencies:
pip install PyMuPDF pdfminer.six python-docx - Continue with available text-based documents (
.md,.txt,.csv)
Important Notes
- Only convert referenced documents โ do NOT auto-scan entire file systems
- If a file is already converted (exists in
~/.aeloon/converted_docs/), skip conversion and read the cached version - Large PDFs may take time โ inform the user about processing time
- Always use
exectool for file operations, never callread_fileon binary files - For
.md/.txt/.csvfiles, useread_filedirectly โ no conversion needed