jlpt-n5-listening-variation-creator - SKILL.md Agent Skill

name: jlpt-n5-listening-variation-creator description: "Create variations of real JLPT N5 listening questions from YouTube clips. Reads data.json (and optional PNG) from tobeprocessed/, applies surgical entity swap, rewrites Japanese dialogue, generates Imagen 3 image prompts, writes TTS script, outputs derived-data.json, and moves the folder to processed/. Use when the user asks to 'process a clip', 'create a variation', or 'generate listening question from clip'."

JLPT N5 Listening Variation Creator

You are an expert JLPT N5 content engineer. Your goal is to create original variation listening questions by swapping entities in real clips while strictly preserving the underlying logic and structure.

Core Directives

CRITICAL: Concurrency Control (Locking): Before ANY other action, you MUST acquire an atomic lock.
- List folders in tobeprocessed/.
- Select a folder that does NOT contain processing.lock/ OR derived-data.json.
- Immediately run: mkdir "path/to/clip_folder/processing.lock".
- If the command fails, the folder is already taken. Start over.
- Never read data.json or start analysis without a successful lock.
CRITICAL: STOP FOR MANUAL APPROVAL: After generating image.png and performing self-validation (Step 10), you MUST STOP. Show the image and the dialogue to the user and WAIT for explicit approval. Never run the tester skill, never generate audio, and never move the folder without the user saying "Onaylıyorum" or "Approved".
Preserve Logic Pattern: If the source is a "Reconsideration" question, the variation must remain a "Reconsideration" question.
Surgical Entity Swap: Change entities (objects, colors, quantities, times, locations) using backend/vocabulary/data/n5_vocabulary.json as the primary N5 vocabulary source.
Read PNG When Available: If a Screenshot PNG is present in the clip folder, read it to understand the original question's visual composition and spatial layout.
No Audio Analysis: Do NOT attempt to read or analyze audio.mp3.
Output to derived-data.json: Save the variation as derived-data.json inside the clip folder.
Handover to Tester: After finishing Step 10 (JSON and Image ready), do NOT move the folder. Leave it in tobeprocessed/ for the jlpt-n5-listening-variation-tester skill to finalize and move.
Self-Validate: After writing derived-data.json and generating image.png, run the validator script and visually check the image. Fix any errors.
Clean Logic Formatting: When referencing panels in logic.tr or logic.en, use ONLY the panel number in parentheses, e.g., (3). NEVER use the word "Panel" or labels like "Distractor_A".

Input Structure

backend/listening/data/selectImage/listening-youtube-data/tobeprocessed/
  clip_XX_XXmXXs_XXmXXs/
    audio.mp3          ← IGNORE — do not read
    data.json          ← PRIMARY INPUT: dialogue, logic, analysis
    Screenshot *.png   ← READ if present: understand visual composition
    .done_slice        ← IGNORE
    processing.lock/   ← LOCK: If present, another agent is working here

Output Structure

backend/listening/data/selectImage/listening-youtube-data/tobeprocessed/
  clip_XX_XXmXXs_XXmXXs/
    ...
    derived-data.json        ← NEW: your output
    image.png                ← NEW: your output
    processing.lock/         ← REMOVE only if you abort; keep if finished

Workflow

Step 1 — INGEST & LOCK

If the user provides a clip folder name, verify it doesn't have processing.lock/ or derived-data.json.
If no folder is provided, list tobeprocessed/ and select the first folder that:
- Does NOT contain processing.lock/ (directory).
- Does NOT contain derived-data.json.

CRITICAL (Atomic Lock): Immediately attempt to create a directory named processing.lock inside the folder:

mkdir "backend/listening/data/selectImage/listening-youtube-data/tobeprocessed/<clip_folder>/processing.lock"

If the mkdir command fails, it means another agent beat you to it. Start Step 1 again and pick the next available folder.
Once the lock is acquired, read data.json to understand: dialogue, logic pattern, key entities.
If a Screenshot PNG is present, read it to understand the original visual scene composition.

Step 2 — ANALYZE

Identify which of the 6 logic patterns this question uses (see references/n5-listening-patterns.md).
Identify the critical entities (the objects/attributes being swapped in the traps).
If PNG was read, determine the image_type using the detection criteria in references/imagen3-prompting-guide.md:
- four_panel_grid — 4 separate equal panels in a 2×2 grid, small numbers 1–4 in the top-left corner of each panel
- numbered_scene — single scene with small position numbers 1–4 inside it
- map_diagram — top-down street/area map with position numbers 1–4 on buildings
- Default to four_panel_grid if no PNG is present.

Step 3 — SUBSTITUTE (Surgical Swap)

Select new N5-level entities from backend/vocabulary/data/n5_vocabulary.json.
Swap entities while keeping the same logic pattern and trap structure.
Colors are valid attributes — Imagen 3 generates full-color images.
Verify the trap logic still works with the new entities (each distractor should fail exactly one or two criteria).
The variation must be meaningfully different from the source clip.

Step 4 — REWRITE DIALOGUE

Write a new Japanese dialogue using only N5 grammar and vocabulary.
Preserve the same grammar structures from the original (only entities change).
Provide a Turkish translation for the intro, dialogue lines, and the question for the transcription_tr section.

Step 5 — GENERATE IMAGE PROMPT (Imagen 3 / Nano Banana)

Follow references/imagen3-prompting-guide.md rules strictly.
Use the image_type determined in Step 2 to select the correct prompt template:
- four_panel_grid → composite 2×2 grid prompt, small numbers 1–4 in the top-left corner of each panel
- numbered_scene → single scene prompt with position numbers 1–4
- map_diagram → top-down map prompt with position numbers 1–4
Record image_type in visual_prompts.image_type.
Record the panel-to-role mapping in panel_map (decide correct panel position — vary it across questions).

Step 6 — WRITE TTS SCRIPT

Follow references/tts-guidelines.md rules strictly.
Voices: Intro_Voice, Male_1, Female_1.
Required sequence: intro → 1s break → dialogue (0.5s between turns) → 1s break → question repeat.
No mixed objects: each entry has EITHER voice+text OR break, never both.

Step 7 — OUTPUT derived-data.json

Write derived-data.json to the clip folder inside tobeprocessed/.
Include both transcription (Japanese) and transcription_tr (Turkish) sections.
Follow the schema in references/derived-data-schema.md exactly.

Step 8 — SELF-VALIDATE JSON

Run the mechanical validator script using the skill's specific environment:

skills/jlpt-n5-listening-variation-creator/venv/bin/python3 skills/jlpt-n5-listening-variation-tester/scripts/validate_derived_data.py <clip_folder>/derived-data.json

If PASS: proceed to Step 9.
If FAIL: read the error output, identify which check(s) failed, fix the JSON accordingly, re-save, and re-run the validator.
Maximum 3 attempts. If still failing after 3 attempts, stop and report the remaining errors to the user.

Step 9 — GENERATE IMAGE

Run the image generation script using the skill's specific environment to create image.png in the clip folder:

skills/jlpt-n5-listening-variation-creator/venv/bin/python3 skills/jlpt-n5-listening-variation-creator/scripts/generate_image.py <clip_folder_name>

The script reads visual_prompts.image_prompt from derived-data.json, calls gemini-2.5-flash-image, and saves image.png inside the clip folder.
Requires JLPT_IMAGE_GEMINI_API_KEY to be set in the environment.
Wait for confirmation that image.png was saved successfully.

Step 10 — SELF-VALIDATE IMAGE & STOP

Read the generated image.png from the clip folder.
Verify:
- image_type match: layout matches declared type (four_panel_grid / numbered_scene / map_diagram)
- Panel content: correct panel shows the answer, distractors show wrong alternatives
- Style: colorful minimalist illustration (or monochrome if requested), flat colors, thick clean outlines, no shading, white background, no borders
CRITICAL: STOP HERE.
Display the generated image.png, the Japanese dialogue, and the Turkish translation to the user.
WAIT for the user to provide explicit approval (e.g., "Onaylıyorum").
DO NOT run Pass 3.5, Pass 4, Pass 5, or Pass 6 of the tester skill until approval is received.
If the user requests changes to the image or JSON, apply them and repeat Step 9/10 as needed.
If any check fails during self-validation: delete image.png, re-run generate_image.py, and re-check.
Maximum 2 attempts (image generation costs API credits). If still failing after 2 attempts, stop and report the issue to the user.
Leave the folder in tobeprocessed/. Notify the user that the variation (JSON and Image) is ready for manual review. Once the user approves, they will use the jlpt-n5-listening-variation-tester skill to finalize it.

Resources

Resource	Purpose
`backend/vocabulary/data/n5_vocabulary.json`	N5 vocabulary source for entity selection
`references/imagen3-prompting-guide.md`	Imagen 3 prompt engineering rules
`references/derived-data-schema.md`	Output JSON schema
`references/n5-listening-patterns.md`	6 logic patterns with trap design rules
`references/tts-guidelines.md`	TTS voice/break formatting rules
`references/n5-grammar-points.md`	N5 grammar reference
`scripts/generate_image.py`	Calls Gemini API to generate `image.png` from `image_prompt`
`../jlpt-n5-listening-variation-tester/scripts/validate_derived_data.py`	Mechanical JSON validator (used in self-validation loop)