feishu-doc-scraper - SKILL.md Agent Skill

name: feishu-doc-scraper description: Extract Feishu (Lark) Docs, Wiki pages/collections, spreadsheets, and Minutes (妙记) transcripts into faithful local Markdown via the lark-cli API (no LLM rewriting of the body; browser-DOM fallback when lark-cli can't reach the content). Use whenever the source is a Feishu/Lark URL and fidelity matters — 导出飞书文档/合集/妙记转写, 把飞书 wiki/知识库转 markdown, archiving a Feishu collection, exporting a 妙记 transcript, or saving a Feishu page — even if the user only says clipping, archiving, converting, or "save this". Also covers the owner-exported .docx → faithful Markdown path. compatibility: Primary path needs the `lark-cli` binary (npm `@larksuite/cli`, verified 1.0.32, 2026-05) authenticated to the target tenant. Fallback path needs a browser automation surface with an authenticated session (Chrome DevTools MCP / Browser Use / Computer Use). docx path needs `python-docx` and a docx→md converter (the bundled doc-to-markdown skill or pandoc). argument-hint: [feishu-url-or-output-path]

Feishu Doc Scraper

Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback, not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed.

Scope (read this first)

This skill's contract is faithful per-source Markdown + a record of what was extracted. It does not decide how the resulting files are named, indexed, deduplicated against existing notes, or organized into a knowledge base — that belongs to the host PKM / the user's own conventions. Stopping at faithful extraction keeps this skill orthogonal and reusable. When the user wants the output filed into a vault, extract first, then hand the clean Markdown to their organizing workflow.

Choose the path

Is the source a Feishu/Lark URL (wiki / docx / sheets / minutes / base)?
├── YES → is lark-cli installed and authenticated to that tenant?
│        ├── YES → PATH A: lark-cli API extraction  (primary — start here)
│        │         └── hit code 131006 / 99991679 (permission denied)?
│        │              └── PATH B: owner-exported .docx → faithful Markdown
│        └── NO  → install/auth lark-cli first (it is worth it); only if
│                  truly impossible → PATH D: browser DOM fallback
├── the URL is a Minutes / 妙记 link, or a doc references one → PATH C: Minutes transcript
└── you were handed an exported .docx (not a URL) → PATH B

A collection/hub is just a docx whose body references other docs — Path A handles it by recursively following the reference graph, not by visiting pages in a browser.

Path A — lark-cli API extraction (primary)

Full command catalog, recursion engine, cross-tenant and personal-space nuances: references/lark-cli-api-extraction.md. The essentials for the common case:

1. Disable the proxy for Feishu domestic domains. Feishu's *.feishu.cn endpoints are direct-connect in mainland China; routing them through a local proxy leaks credentials through the proxy and gets DNS-hijacked. lark-cli itself warns about this. Always:

export LARK_CLI_NO_PROXY=1

This does not conflict with any "Claude/Anthropic domains must use the proxy" rule — Feishu is a different host and is direct.

2. Classify the URL, then resolve to a fetchable doc token.

…/wiki/<node_token> — a wiki node token is not a doc token. Resolve it first:

lark-cli wiki spaces get_node --params '{"token":"<node_token>"}'
# → .data.node.obj_token  and  .data.node.obj_type  (e.g. "docx")

…/docx/<doc_token> — already a doc token, fetch directly.
…/sheets/<token> — spreadsheet, use the sheets commands (see reference).
…/minutes/<token> — Minutes, go to Path C.

3. Fetch the body programmatically — never via the model. The body field moved across lark-cli versions, so probe both rather than hard-coding one (this keeps working whichever version is installed):

lark-cli docs +fetch --doc <obj_token> --format json > /tmp/fetch.json 2> /tmp/fetch.err
# ≤1.0.32: clean Markdown in .data.markdown.
# 1.0.55: body moved to .data.document.content as HTML (.data.markdown is null).
if jq -e -r '.data.markdown // empty' /tmp/fetch.json > source.md && [ -s source.md ]; then
  : # got clean Markdown directly
else
  jq -r '.data.document.content' /tmp/fetch.json | pandoc -f html -t gfm > source.md
fi

--format markdown is not a valid value (lark-cli warns and falls back to json). Keep stdout and stderr separate — a harmless [deprecated] line goes to stderr, and piping 2>/dev/null and jq together produced a false Exit code 5 in practice. The body must reach disk via jq/pandoc, never retyped or summarized by the model — paraphrasing silently corrupts source text, the single most important fidelity rule. (pandoc only re-renders HTML structure to Markdown; it does not rewrite prose, so fidelity holds.)

4. If it's a collection/hub, follow the reference graph (BFS). The hub body contains <mention-doc>, <sheet>, <image> tags and cross-tenant / Minutes / Tencent-Meeting URLs. Extract every reference, dispatch by type, fetch, and repeat on each newly fetched doc until no new references remain (leaf nodes). Use the bundled extractor so nothing is silently missed (a missed reference = a missing document, the #1 hub-scraping failure):

python3 scripts/feishu_extract_refs.py source.md   # → JSON list of {type, token, title}

Recursion loop, dispatch table, and the cross-tenant/my.feishu.cn personal-space rules are in the reference.

5. Final residual-tag check (acceptance gate for collections). Every rich-media reference must have been resolved and rendered:

grep -rlE '<(lark-table|lark-tr|sheet token=|mention-doc|view type=)' . && echo "UNRESOLVED — keep recursing" || echo "clean"

Must be empty before you stop.

Path B — permission denied → owner-exported .docx

lark-cli wiki spaces get_node returning code 131006 … node permission denied, user needs read permission (or fetch returning it) is a hard Feishu-side boundary. lark-cli, anonymous curl, and the browser all fail it — this has been verified exhaustively; do not spend cycles trying to bypass it. The only correct move: ask the permission holder to export the doc as .docx and send it back out-of-band, then convert with fidelity (font-size→heading and w:shd→highlight restoration, then visual verification). Full procedure: references/docx-export-to-markdown.md.

Path C — Feishu Minutes (妙记) transcript

lark-cli minutes only returns metadata and can download audio/video — it cannot export the text transcript. The transcript comes from a native endpoint called through lark-cli api, and needs an extra scope granted via a device-flow login. Native AI transcription is far better than downloading the media and re-running ASR — never do the latter. Endpoint, scope name, the device-flow timeout trap, and per-minute (not per-tenant) permission behavior: references/feishu-minutes-transcript.md.

Path D — browser DOM fallback (last resort)

Only when lark-cli genuinely cannot reach the content (no install possible, and the doc is not permission-walled). This is the old virtual-scroll / TOC-driven DOM capture workflow. It is slower, depends on a connected browser surface (the in-browser extension frequently fails to connect), and an anonymous debugging Chrome can only tell you whether a page is publicly reachable — it cannot read login-walled content. Workflow: references/browser-dom-fallback.md. Battle-tested DOM rules (virtual scroll, data-block-id ordering, table/bullet extraction, image streams): references/browser-failure-rules.md.

Hard rules

These are the rules whose violation silently ruins the output. Each has a reason — follow the reason, not just the letter.

Never let the document body pass through the model. Extract with jq/cat/scripts straight to disk. The model paraphrasing source text is undetectable later and destroys fidelity. This is why Path A beats the browser path structurally.
export LARK_CLI_NO_PROXY=1 for *.feishu.cn. Otherwise credentials transit a local proxy and DNS is hijacked.
Transcripts come from the platform's native transcription, never re-ASR. Downloading media and transcribing again loses speaker labels, timestamps, and accuracy.
A generated docx Markdown is not done until it has been visually verified against the source (render to image, read it). Feishu-exported docx uses font-size+bold for headings rather than Word heading styles, so a "no errors, word count matches" check passes while the entire heading hierarchy is silently flat. Text-level checks cannot catch this.
Do not 死磕 (grind) on docx embedded-image download. lark-cli (through 1.0.32) cannot download <image> tokens from a docx — exhaustively verified. Register the image tokens and note "needs document owner to right-click → save"; the text is the value, images are a tracked gap.
HTTP 200 from anonymous curl ≠ accessible. A Feishu login wall returns 200 with a body containing accounts.feishu.cn / login / passport / an empty <title>. Check the body, never infer "public" from the status code.
A file "not found" by a search agent is not authoritative. Verify against authoritative sources before concluding (this is general Inference Discipline; relevant when locating where ingested content already lives).
U+FFFD final check on every produced file: LC_ALL=C grep -rl $'\xef\xbf\xbd' . must be empty. A replacement character means an encoding step corrupted the text.

Acceptance contract

Stop only when all that apply are true:

Every fetched body reached disk via jq/script, not retyped by the model.
Collections: the residual rich-media-tag grep (Path A step 5) is empty — every mention-doc/sheet/cross-tenant reference was followed to a leaf.
LC_ALL=C grep -rl $'\xef\xbf\xbd' . is empty.
docx path: rendered to an image and visually compared to the source; heading hierarchy and highlights match (see docx reference's checklist).
Browser fallback only: TOC coverage + scale check (see browser-failure-rules.md).
Each output file's frontmatter records source (the original URL/token) and, if any post-processing was applied, a post_process provenance line.
Permission gaps (131006 docs not exported yet, undownloadable images) are explicitly listed for the user — a transparent gap beats a silent omission.

Do NOT attempt

Verified dead-ends — retrying them only wastes the session. Full table with failure modes and root causes: references/permission-and-failure-boundaries.md. The top ones:

Bypassing 131006 permission-denied by any means (lark-cli / curl / anonymous browser) — it is a server-side boundary.
Downloading docx embedded images via docs +media-download, api …/drive/v1/medias/<t>/download (with or without extra), or schema drive.medias.download — none work; lark-cli even mis-reports the real HTTP 400 as "empty JSON".
WebFetch against open.feishu.cn/document/server-docs/... for API specs — backend is flaky; use open.feishu.cn/llms-docs/zh-CN/llms-<module>.txt instead (LLM-friendly, stable).
AppleScript/JXA executeJavaScript, Chrome CDP on port 9222 — disabled/empty in this environment (browser path only).
Using minimax-docx to convert docx→md — it is a docx authoring tool; use the doc-to-markdown skill instead.

Bundled resources

scripts/feishu_extract_refs.py — deterministic reference-token extractor; the recursion engine's core. Run it on every fetched body to enumerate <mention-doc>/<sheet>/<image>/cross-tenant/Minutes/Tencent-Meeting references as JSON.
scripts/restore_docx_headings.py — for Path B: reads true font sizes via python-docx, maps them to heading levels, restores w:shd highlights to Obsidian ==…==, without retyping body text.
scripts/feishu_dom_capture.js — Path D: injectable end-to-end browser DOM capture.
scripts/download_feishu_images.py — Path D: SSR image extraction when browser automation is unavailable.
scripts/build_feishu_markdown.py — Path D: render a capture manifest into Markdown.
scripts/check_heading_coverage.py — coverage verification (both paths).
references/lark-cli-api-extraction.md — Path A full reference (commands, recursion, sheets, cross-tenant).
references/feishu-minutes-transcript.md — Path C native transcript API + scope auth.
references/permission-and-failure-boundaries.md — error codes + the full Do-NOT-attempt table.
references/docx-export-to-markdown.md — Path B faithful conversion procedure.
references/browser-dom-fallback.md + references/browser-failure-rules.md — Path D.
references/capture-manifest.md — manifest shape for build_feishu_markdown.py.

Next step

After extraction completes, the clean Markdown typically feeds the user's own knowledge-base ingestion (filing, indexing, dedup) — which is deliberately out of this skill's scope. If the source went through Path B (a docx), the doc-to-markdown skill is already part of that flow. Offer the handoff; do not auto-organize:

Extraction complete: [N] sources → faithful Markdown ([M] permission/image gaps listed).

Options:
A) Hand off to your PKM/organizing workflow — file & index these (Recommended if part of a vault)
B) Run /daymade-docs:docs-cleaner — consolidate redundant content across the extracted files
C) Stop here — the faithful Markdown is the deliverable