gorden-img2pptx - SKILL.md Agent Skill

name: gorden-img2pptx description: >- Reverse-engineer a slide image (image-PPT page / screenshot / any rendered slide) into a fully EDITABLE .pptx using a forced 4-layer pipeline: reproduce background + chroma-key the whole frame + chroma-key element icons/decorations + extract text via vision — then compose (text = real editable text boxes, frame/icons = movable images). 이미지 PPT/슬라이드 캡처를 편집 가능한 PPTX로 복원, 배경 복제, 프레임/아이콘 추출, 텍스트 추출, 배치 복원. Use when "이미지를 편집 가능한 PPT로", "PPT 캡처 변환", "image to editable pptx", "convert slide image to pptx", "추출/복원 슬라이드". Do NOT use for generating new slides from a topic (use gorden-ppt-gen), full topic→editable run (use gorden-super-ppt), translating slides (use ppt-image-translator), or template-based decks (use ppt-template-engine). Ported from @Gorden Sun (github.com/GordenSun/GordenSuperPPTSkills), adapted for Claude Code.

gorden-img2pptx — Slide Image → Editable PPTX

Process images one at a time, forced into 4 layers (bottom→top), then compose an editable .pptx. First 3 image layers come from an image-gen backend (extraction); the text layer comes from Claude vision as real text boxes.

background(reproduced) + whole frame(chroma-key, no split by default) + icons/decor(chroma-key, sliced) + text(vision)

Read references/image-to-pptx.md for full prompts, coordinate/font math, and the QA feedback loop. Read references/runtime-notes.md for the Claude Code image backend.

Backend (Claude Code adaptation)

Codex imagegen/view_image is replaced by:

Extraction (B2/B3/B4) = primary path: python3 ../gpt-image-2/scripts/edit_image.py --input <src.png> --prompt "<extraction prompt>" --output <layer_raw.png> --style none — the source image IS the edit target (equivalent to Codex view_image + edit). Never pass only a file path inside a generate prompt.
Fallback: mcp__hf__gr1_z_image_turbo_generate (main agent only; weaker at extraction — flag quality risk in QA).
Vision (text/icon positioning) = Claude native: Read the PNG, emit structured JSON. NEVER run tesseract/EasyOCR/system OCR.
chroma_key / slice_grid / compose_pptx / layout_guard / placement_qa / visual_compare_qa: python scripts, runtime-agnostic, used verbatim.
Requires OPENAI_API_KEY for the primary path. Missing key → HF fallback or stop and tell the user.

🔴 Forced 4 layers — none may be skipped (core law)

background → whole frame → icons → text. Every page must really produce a frame (B3) and icon sheet (B4) and chroma-key them. Default: one transparent frame.png as the frame layer; only split into frame_parts/ if the user explicitly asks for movable frame pieces.

Banned degenerations: ① using the raw image as background with text on top (double text); ② redrawing background/cards/title-bars/charts/timelines/decor with native PPT shapes/fills/lines; ③ cropping regions out of the source as background/frame/icon assets; ④ omitting frame or icons; ⑤ drawing any image layer with code/PIL/SVG/HTML/Canvas/matplotlib.

Frame = everything except background, icons/decor/word-art, and plain text: containers (incl. fill + title bars), divider/connector lines, all chart graphics (line/bar/step/pie/axes/grid/trend), ribbons, decorative lines/blocks → extracted in one pass, 1:1 in shape/size/position with the source.

Word-art = decoration, not plain text: gradient/calligraphic/warped/outlined/badge-style text goes to the B4 icon layer as an image. B7 vision text only handles plain-font text.

Evidence gate: each page must write imagegen-assets-manifest.json recording B2/B3/B4 generated_source, copied_to, layer, prompt_file, backend, key_color. Any layer whose backend is not the image-gen path (i.e. programmatic/PIL/SVG/HTML/Canvas/matplotlib/screenshot), or any missing generated_source/prompt_file, fails the page — redo, do not deliver.

🧱 Task isolation (anti file-cross-contamination)

Create a unique RUN_ROOT per task (e.g. image2pptx_runs/<ts>_<slug>/); write/read ALL intermediates, prompts, manifests, layout, preview, QA, and final PPTX under it. Never write to or read from fixed editable/01, out/, qa/. layout.json image paths resolve inside the same RUN_ROOT (prefer omitting assets_dir); manifest copied_to must be an absolute path inside RUN_ROOT.

Per-page workflow (check each box)

- [ ] B0  Isolation: create unique RUN_ROOT; all files only under it
- [ ] B0a Edit target: before each B2/B3/B4 edit call, the source PNG is the --input edit target; prompt says "extract/erase the target layer FROM this image", not "use as style reference"
- [ ] B1  Probe color: probe_palette.py → decide key color (default #00ff00; if page contains green use #ff00ff etc.)
- [ ] B2  Background: edit_image.py reproduce clean background (no text/icon/frame/card/placeholder); log real backend call
- [ ] B3  Frame: edit_image.py extract frame on the key-color background (fixed prompt, see ref §B3); only swap XXXXXX for the B1 key color; log call
- [ ] B4  Icons: vision-diff source vs B3 frame preview → list source icons/decor/word-art and any that leaked into frame; generate ONLY elements missing from frame, laid out as an N×N grid (no grid lines); log call
- [ ] B5  Chroma + slice: chroma_key.py de-key frame (--preset frame-safe --scale 2) and icons (--preset icon-safe --scale 2); keep whole frame.png by default; slice_grid.py --auto --pad 24 --contact-sheet the icon sheet; check icons_contact_sheet.png + frame.png on gray — any cut/merged/edge-touch icon → redo
- [ ] B6  Place icons: layout.json uses whole frame.png as "frame"; measure each icon/word-art x/y/w/h in SOURCE pixel coords → fractions (x=x_px/ref_width …); units:"fraction"; re-calibrate against preview, anchored on source bbox center
- [ ] B7  Text: Claude vision reads all plain-font text (content/pixel pos/size/color/weight/align) → source-pixel bbox → fractions; size_ratio=text_height_px/ref_height; bold default false (only titles/buttons/keywords); line_spacing≈1.2 for multi-line; fix weight/size/spacing BEFORE bbox
- [ ] B7a Contract check: layout_guard.py <src> layout.json --strict (ref_width/ref_height = source pixels; source_bbox ↔ x/y/w/h consistent; catches half-res font sizes & all-bold)
- [ ] B7b Source-box review: placement_qa.py → open *_source_boxes.png, confirm each box wraps the real source object (guard can't catch mis-measured objects)
- [ ] B7c Frame-anchor recalibrate: re-check text glued to frame (title bars, buttons, bottom summary) against the generated frame.png anchors; record frame_anchor + dy_px
- [ ] B8  Compose: compose_pptx.py → .pptx + --preview-dir
- [ ] B9  QA: verify manifest backends; visual_compare_qa.py side_by_side/blend/diff_heatmap + Claude vision compare source↔preview; ≥1 layout recalibration round; no missing/duplicate/cut icons, no overflow text

Stack order (bottom→top): background → frame → icons → texts. If the user requests frame split: background → frame_parts (icons[] role:"frame_part") → icons → texts. shapes carry no visual content by default.

Scripts & deps

Script	Role
`probe_palette.py`	detect green, recommend safe key color
`chroma_key.py`	color/line-preserving de-key (`frame-safe`/`icon-safe`, `--scale 2`)
`slice_grid.py`	`--auto`/`--grid` slice icon sheet; `--components` split frame into parts
`frame_parts_to_icons.py`	(opt) map frame-part bboxes → layout.icons[] role:"frame_part"
`layout_guard.py`	coordinate/font contract validation (`--strict`)
`placement_qa.py`	draw bboxes on source + preview
`visual_compare_qa.py`	side-by-side / blend / diff-heatmap for vision QA
`compose_pptx.py`	compose background+frame+icons+texts → editable .pptx (`--preview-dir`)

pip3 install python-pptx pillow numpy. Input image-PPT usually from gorden-ppt-gen; one-shot topic→editable → gorden-super-ppt.