name: liteparse
description: Use this skill whenever a task involves a document file (PDF, DOCX, PPTX, XLSX, or image) and you need to read it or get text, tables, or specific values out of it — including to answer a question about its contents, look up a figure, extract data, or convert it to text/JSON. Provides fast, local, model-free extraction with no cloud or API key. Reach for this instead of ad-hoc pdftotext/pypdf/textract whenever a question or task references a document file.
compatibility: Requires Node 18+ and @llamaindex/liteparse installed globally (npm i -g @llamaindex/liteparse). LibreOffice for Office files; ImageMagick for images.
license: MIT
metadata:
author: LlamaIndex
version: "0.3.0"
LiteParse
Extract text from documents locally with the lit CLI — a fast, model-free parser (a drop-in,
faster replacement for pdftotext/pypdf).
Answering a question about ONE document: stream and search in a SINGLE shell command
lit parse writes plain text to stdout, so pipe it straight into your normal search tools in
one Bash command — exactly how you would use pdftotext -layout file.pdf - | grep. Do not
write an intermediate file, and do not use the Read or Grep tools on a saved file: each of
those is an extra agent round-trip. Keep parse+search fused in one command:
lit parse ./input.pdf --format text --no-ocr | grep -i -n -A3 -B3 "total assets" | head -40
lit parse ./input.pdf --format text --no-ocr | sed -n '900,945p'
- Born-digital PDF (has a real text layer): add
--no-ocr— much faster, identical text. - Scanned PDF / image: drop
--no-ocr(OCR on). If the value is missing from the OCR text or the digits look wrong, read the page visually instead of trusting OCR: render it withlit screenshot ./input.pdf --target-pages "N" -o ./shots/and view the PNG. - Multi-column tables: piped
--format textkeeps most layout; if columns collapse so you can't tell which column a number is in, render that page and read it visually.
Answering MANY questions about the same document(s): parse once, reuse
Only here is it worth materializing a file (so you don't re-parse per question):
lit parse ./inputs/<doc>.pdf --format text --no-ocr -o ./parsed/<doc>.txt # once per doc
grep -i -n -A3 -B3 "total assets" ./parsed/<doc>.txt # then search the file
Core flags
--format text|json · --no-ocr · --dpi <n> (default 150) · --target-pages "1-5,10" ·
--ocr-language <iso> · lit batch-parse ./in ./out. Use --format json only when you need
bounding boxes / layout (it is much larger — still search it, don't load it whole).
Setup
PDF works out of the box. If lit is missing: npm i -g @llamaindex/liteparse (verify
lit --version). Office docs need LibreOffice; images need ImageMagick (auto-converted to PDF).