collect-jd

name: collect-jd description: Use when collecting, curating, or organizing job descriptions (JDs) — triggers include "JD 모으고 있어", "JD 수집", "JD 큐레이션", "JD 정리하고 있어", "오늘 수집 정리해줘", "오늘 본 JD", "관리 중인 JD", "쌓아둔 JD", "내 프로필에 맞는 JD 쌓아줘", "내 이력에 맞는 JD 큐레이션", and "싹 돌려" (in JD rescan context). Do NOT trigger on discovery phrases claimed by resume-apply ("JD 찾아줘", "JD 골라줘", "공고 뭐 있지", "지원할 곳", "어디 넣을까") — those belong to resume-apply. Skill maintains project-scoped state at `$OMT_DIR/collect-jd/` (never global).

Canonical principle: Verification stage MUST be on the canonical path. No bypass routes. Every discovered URL goes through the same flow; "fast paths" that skip verification are forbidden.

Dedicated skill for JD collection, curation, and organization. Specific rules are added through Phase B pressure scenario cycles (TDD RED-GREEN-REFACTOR).

Scope Boundary

collect-jd: JD discovery · collection · curation · organization (this skill)
resume-apply: skill that consumes already-recorded JDs (this skill does not participate)
review-resume: resume review (this skill does not participate)
resume-forge: resume material mining (this skill does not participate)

MANDATORY: Gate Task Creation

Gate Task Creation (MANDATORY) — at skill invocation start, pre-create 8 named gate tasks via TaskCreate. Each task is the source of truth for gate completion; the per-source ledger is the source of truth for per-item progress.

→ Details: reference/bootstrap.md

State Location

All state under $OMT_DIR/collect-jd/ only. $OMT_DIR is read from the environment; this skill must not compute it directly. If $OMT_DIR is unset, abort + recovery guidance — global fallback forbidden. Forbidden Paths: ~/.omt/global/**, ~/.omt/<other-project>/collect-jd/**, /tmp/**, and any absolute path outside $OMT_DIR.

→ Details (rejection protocol, rationalization loopholes): reference/bootstrap.md#state-location--forbidden-paths

Session Lock (MANDATORY)

Session Lock (MANDATORY) — atomic .lock file with PID + liveness check (kill -0); single-writer per session. Acquire at Gate 1, release at Gate 8 (after Coverage Gate passes).

→ Details: reference/bootstrap.md

Storage Backend Interview (MANDATORY)

Storage Backend Interview (MANDATORY) — on first run, AskUserQuestion is mandatory to collect platform + how. Silent default to filesystem is forbidden.

→ Details: reference/bootstrap.md

Atomic Write Pattern (MANDATORY)

Atomic Write Pattern (MANDATORY) — .tmp → fsync → rename for all single-file writes; never partial overwrites. Mandatory at JD file persist, sources.yaml updates, and seen.jsonl appends.

→ Details: reference/bootstrap.md

Ingest Paths (5)

Direct URL input
Text paste
File or folder path
Company name (only within sites registered in sources.yaml)
Batch rescan ("싹 돌려")

Before each Ingest Path execution, Phase 0 profile interview + Dedup L1/L2 must be performed without exception.

Sources Registration (MANDATORY)

At session start, load $OMT_DIR/collect-jd/sources.yaml. If empty or absent, propose via a single AskUserQuestion: "Do you have JD source sites to register?" (skippable — not as mandatory as Profile Interview). When user provides a URL, atomic append with {slug, name, careers_url, added_at, pagination, crawl_state, ingest} structure.

Source-level Ingest Config (ingest): schema = {detail_required_before_persist: bool}. Default false. When true, the source's Full Coverage Ingest Protocol MUST run Tier 2 (detail body fetch) for every JD before persist — Tier 1 immediate persist is FORBIDDEN. See Full Coverage Ingest Protocol.

Reusable Crawl: When user utterance contains trigger phrases "오늘 돌려" / "싹 돌려" / "전체 재크롤" / "sources 돌려" etc. → iterate all registered sources → perform Listing Pagination per source → per-JD L1 evaluation (Algorithm B) + Dedup Gate + Classify + Persist. No automatic scheduling.

CRITICAL: Open-web free crawl when sources.yaml is empty is forbidden. Even on user "싹 돌려" utterance, if source count is 0, report "등록된 소스가 없어요" and prompt registration.

→ Details: reference/dedup-and-discovery.md#sources-registration

Listing Pagination (MANDATORY, single-path)

Single source of truth: pagination.how. All listing discovery — first-time auto-detect, user-interview fallback, cached re-execution, invalidation re-interview — collapses into one algorithm discover_listing(source).

Algorithm:

If pagination.how absent → try Tier A 9-pattern catalog → success: serialize as how={origin: auto, pattern, params}. fail: AskUserQuestion → how={origin: interview, pattern, params, prose}.
Execute pagination.how.
On execution failure → invalidate per trigger table (transient: 1-retry; structural: immediate). Push current how to previous_how 3-slot ring → AskUserQuestion 3-option (new method / Tier A retry / skip). On retry-fail → raise (silent empty forbidden).

Schema: pagination.how = { origin: auto|interview, pattern: <13-enum>, params: {}, prose: <free-form> }. previous_how: [] inline ring (LRU, max 3). invalidated_at: null ISO timestamp.

CRITICAL:

First-run discovery skipping Tier A 9-pattern catalog → AskUserQuestion shortcut forbidden.
discover_listing returning empty [] on execution failure forbidden — must raise; caller skip-with-audit, no Per-Site Memory false-clean.
Coverage Verification fires after was_invalidated: false (success path) only. raise → skip Coverage + Per-Site Memory update.

→ Details (γ schema, 13-pattern catalog, invalidation trigger table, previous_how ring, raise-on-failure contract, 2 new loopholes): reference/dedup-and-discovery.md#listing-pagination

Listing Pagination Coverage Verification (MANDATORY, 3-check)

Coverage Verification (MANDATORY, 3-check) — declared-total match, scroll stability, infinite-scroll absence. Without coverage_proof, batch_run_completed=true is forbidden.

→ Details: reference/dedup-and-discovery.md

Per-Site Crawl Memory (MANDATORY)

Maintain per-source crawl memory in sources.yaml.<source>.crawl_state (3 sub-groups) and $OMT_DIR/collect-jd/crawl_state/<source>/seen.jsonl (append-only file).

Storage layout:

$OMT_DIR/collect-jd/crawl_state/<source>/seen.jsonl — one JSON object per line, each < 1 KB. Append via POSIX open(path, 'a'). Session-lock guarantees single-writer.
Line schema: {"id": "...", "url": "...", "processed_at": "<ISO8601>", "verdict": "included|excluded|ambiguous", "role_title": "..."}
The id field is a deterministic key derived from the per-site identifier_kind strategy — NOT an auto-generated UUID.

Per-Site Crawl Memory schema (sources.yaml <source>.crawl_state sub-keys + seen.jsonl line schema): see reference/dedup-and-discovery.md — do not restate inline.

Each source records its id extraction strategy in sources.yaml.<source>.crawl_state.seen via two fields: identifier_kind (strategy enum) + identifier_extractor (param name for id_query, null for url, hash spec for fingerprint).

Re-crawl algorithm (Algorithm B canonical): every discovered URL → L1 → 1 of 4 terminal states (new_ingest|touch_only|ttl_recheck|manual_skip). seen.jsonl is audit/lookup index, NOT a pre-L1 exclusion gate. Drift detection (seen_hit + L1_miss / L1_hit + seen_miss) is mandatory.

→ Details: reference/dedup-and-discovery.md

Per-Source Ledger (MANDATORY)

One ledger file per source per session, used by Coverage Gate (Gate 8). Without it, Gate 8 cannot pass. The row schema, lifecycle (Gates 4-7), and Coverage Gate algorithm are defined canonically in dedup-and-discovery.md.

→ Details: reference/dedup-and-discovery.md

Detail Split Auto Fan-out (MANDATORY)

Detail Split Auto Fan-out — when a posting page advertises N positions in a single anchor, split into N separate JD files with parent_url + sub_position (presence-coupled).

team-level granularity도 의무. body에 subsidiary 산하 team labels 명시 시 subsidiary × team 단위로 fan-out, sub_position = '<subsidiary> / <team>' format 사용. 단일 단어 stack tag(Kotlin 등)는 team label이 아님.

→ Details: reference/ingest-and-curation.md

Identifier Kind Heuristic (MANDATORY)

Identifier Kind Heuristic (MANDATORY on first source registration) — choose id_query / url / fingerprint based on URL pattern; silent default forbidden.

→ Details: reference/dedup-and-discovery.md

Phase 0: Profile Interview Required (MANDATORY)

When $OMT_DIR/collect-jd/profile/profile.yaml is absent, a minimum 3-round profile interview (AskUserQuestion) is required before JD collection. Round 1: career history, years of experience, preferred domains. Round 2: tech stack, strengths. Round 3: company, salary, location, remote work, exclude preferences. After the interview, atomic write profile.yaml (includes version: 1 field). If profile exists, proceed to normal collection. 5 rationalization patterns blocked — urgency, being in a hurry, or having received a URL are none of them valid reasons to skip the interview.

→ Details (rationalization loopholes, purpose explanation): reference/bootstrap.md#phase-0-profile-interview-required

Dedup (L1 URL-only + L2 LLM similarity)

Run dedup in L1 → L2 order before writing a new JD file (MANDATORY).

CRITICAL — Dedup Check Gate rules:

Even if jobs/ is empty, the L1 gate must be recorded as executed. "Skip because jobs is empty" is forbidden — trivial-pass must not be silently processed; must be explicitly logged as "L1 gate executed: 0 candidates".
Even when L2 conditions are not met (0 JDs for the same company_slug), record "L2 gate evaluated: not applicable" in audit.
Saving without running the dedup gate is forbidden. If fingerprint_check field is empty, reject the save.

→ Dedup Gate Enforcement details: reference/dedup-and-discovery.md#dedup-check-gate-enforcement

L1 / L2 Dedup (MANDATORY) — L1 = URL-only normalize match (single-key gate). L2 = LLM similarity check on L1 miss with same company_slug OR L1 hit + TTL exceeded. L2 outcomes (same:true|false) and persist actions are spec'd canonically in dedup-and-discovery.md.

→ Details: reference/dedup-and-discovery.md + reference/dedup-and-discovery.md → Flow diagram (L1→L2 decision tree): reference/dedup-and-discovery.md#decision-flow

Matching Loop (history → rules → filter) (MANDATORY)

3-phase verdict against profile/rules.yaml before saving each JD.

Phase 1: If same URL/slug pair exists in jobs/**/*.md → inherit status. Otherwise proceed to Phase 2.
Phase 2: Pinned prompt reference/ambiguity-prompt.md, temperature 0. match → status: included (auto). mismatch → status: excluded (auto, Exclude Flow rules apply). ambiguous → auto-verdict forbidden; must proceed to Phase 3.
Phase 3: AskUserQuestion — Korean question based on missing_signals. Options: include / exclude / defer. Call immediately even in Batch mode, no queuing.
Auto-decision audit trail: on auto-save, record auto:<verdict>:<rules.yaml sha256 short 8> in reason_note.

Note: Matching Loop is the verdict algorithm invoked inside each Full Coverage tier.

→ Details (rationalization loopholes, counterexample): reference/ingest-and-curation.md#matching-loop → Flow diagram (Phase 1→2→3 decision tree): reference/dedup-and-discovery.md#decision-flow

Full Coverage Ingest Protocol (MANDATORY, 3-tier)

Note: Full Coverage is the input-depth escalation ladder; Matching Loop runs inside each tier.

Process all JDs discovered from listing scrape without omission. Escalate in order from information exposed on the discovery screen.

Tier 1 — Listing Metadata Resolution: Extract role_tags from anchor.innerText in full (title + stack label + subsidiary badge, etc.) → immediately persist when a single rules.yaml rule triggers. Reading only the title is forbidden.

Tier 1 Eligibility (MANDATORY): Tier 1 immediate persist is allowed ONLY when sources.yaml.<source>.ingest.detail_required_before_persist: false (or absent — default false). When true, Tier 1 is FORBIDDEN: every JD MUST escalate to Tier 2 detail fetch before persist, and Detail Split Auto Fan-out check MUST run on the body. This eliminates the operational gap where multi-subsidiary or multi-position JDs would be silently saved as a single record without fan-out detection.

Why source-level (not per-JD heuristic): Listing-level signals (e.g., "외 N개 계열사" suffix) cannot reliably detect body-only fan-out signals. The per-source declarative config is the canonical decision point — uniform within a source, no runtime branching per JD.

Tier 2 — Detail Fetch Verification: MANDATORY escalation when Tier 1 is ambiguous. Playwright browser_navigate → extract body → re-judge. Persist when judgment is clear.
Tier 3 — User Interview: MANDATORY AskUserQuestion when ambiguity persists after Tier 2 (Korean question based on missing_signals, options: include/exclude/defer).

CRITICAL:

Obligation to obtain full anchor.innerText. Parsing only the title and ignoring stack labels is forbidden.
Silent skip at tier boundaries is forbidden: pending dump when Tier 1 is ambiguous is forbidden; pending dump when Tier 2 is ambiguous is forbidden.
Declaring batch_run_completed=true when processed_count < discovered_count is forbidden. Record batch_run_completed=false + pending_count=<N>.
T11 real violation: Toss Server Developer #197 — listing innerText contained "Kotlin · Java · Spring · Backend" → should have been an immediate match rule trigger for rules.yaml match rule #1, but was missed due to title-only parsing.

→ Details (Tier 1/2/3 spec, decision flow chart, rationalization loopholes, counterexample): reference/ingest-and-curation.md#full-coverage-ingest-protocol

Exclude Flow (tags + reason_note MANDATORY)

When saving with status: excluded, simultaneously required: tags: [...] (minimum 1, tags.yaml emergent slug) + reason_note (verbatim user utterance, empty string forbidden). If missing, trigger Emergent tag interview before save: (1) collect reason (2) derive tag (top-3 candidates or new slug) (3) update tags.yaml (4) atomic write. This flow does NOT apply to included / ambiguous / pending.

→ Details (emergent tag interview, tags.yaml schema, loopholes, counterexample): reference/ingest-and-curation.md#exclude-flow

Reversal (status change record) (MANDATORY)

When changing an existing file's status, prepend prev: <prev_status> @ <ISO8601 date> at the top of reason_note. Atomic write (.tmp → rename). Multiple reversals accumulate (prepend repeatedly; topmost = most recent). On rules re-evaluation: append (rules_reeval:<sha short 8>) suffix. No exceptions: first save · L1 last_checked_at update · L2 fingerprint_check update.

→ Details (rationalization loopholes): reference/ingest-and-curation.md#reversal

Manual Edit Safety

Batch rescan will never overwrite files whose frontmatter the user has manually edited. If any of the detection signals match (future last_checked_at · canonical contract violation [non-standard field OR value outside enum]), skip that file + add 수동 편집 감지: N건 line to the report.

→ Details: reference/ingest-and-curation.md#manual-edit-safety

Ingest Validation

Before WebFetch · file · text ingest, check body length (< 200 chars) and stop signals only (login/captcha/403, etc.). On failure: save forbidden + report "유효 JD 아닌 것으로 보임" error + record to $OMT_DIR/collect-jd/ingest-failures.log.

Use the insane-search skill for WebFetch.

→ Details: reference/ingest-and-curation.md#ingest-validation

Batch Mode Report Schema (MANDATORY)

On batch rescan completion, the last line of the response must exactly match this regex:

^신규: \d+건, 기존: \d+건, 업데이트: \d+건$

Zero counts must not be omitted. Format variations are forbidden. Record only actual aggregate results.

→ Details (definitions, examples, forbidden patterns, loopholes): reference/ingest-and-curation.md#batch-mode-report-schema

Role Tagging (MANDATORY)

Two fields required when saving a JD: role_title_verbatim (verbatim original title, no modification) + role_tags: [...] (LLM call, subset of taxonomy.yaml enum, temperature 0). Korean synonyms (백엔드/서버개발자/서버사이드) must include backend. On JSON parse failure: retry once; on 2nd failure, report error (saving empty array is forbidden).

→ Details (taxonomy baseline, LLM invocation contract, pinned prompt, loopholes, counterexample): reference/ingest-and-curation.md#role-tagging

YAML Robustness

On parse failure for any state YAML (profile/taxonomy/rules/tags/sources/config): no crash. Copy original to <file>.bak.<ISO8601> once → AskUserQuestion with 2 options (edit manually [default] / reset to default [data loss warning]). Automatic deletion or cleanup of user data is forbidden.

→ Details: reference/ingest-and-curation.md#yaml-robustness

Company-Name Ingest

Ingest path #4 (company name only) operates only within sites registered in sources.yaml. For unregistered companies → WebFetch/open-web search is absolutely forbidden; trigger AskUserQuestion with "공식 채용 페이지 URL 을 알려주세요". When user provides a URL, append to sources.yaml then proceed with standard flow. Blacklist supported.

→ Details: reference/dedup-and-discovery.md#company-name-ingest

Rules Re-evaluation

Re-derive rules.yaml based on today's collection results. Trigger phrases: "오늘 수집 정리해줘" / "오늘 본 JD로 규칙 업데이트" / "규칙 재평가" / "rules 다시 뽑아줘" / auto-propose when 1 or more include·exclude occur within a session. Scope: only JD files where the date portion of last_checked_at is today (excluding manual-edited files). Workflow: (1) load scope + store rules.yaml.sha256.before in memory (2) LLM call (temperature 0) → generate proposed rules (3) atomic write rules.yaml.proposed (.tmp → rename, includes version:1 + _proposed_at + _based_on) (4) display diff + AskUserQuestion (approve / reject / edit manually) (5) on approve, race check required: recompute sha256 of rules.yaml → if mismatch with before, abort (6) race OK → overwrite rules.yaml (atomic write, excluding _proposed_at/_based_on) + remove .proposed. If 0 JDs today, stop immediately. Overwriting rules.yaml directly without approve is forbidden.

→ Details: reference/ingest-and-curation.md#rules-re-evaluation

Reference Index

reference/bootstrap.md — Session-init rules (Gate tasks, Session Lock, Storage Backend, Atomic Write, State Location, Profile Interview)
reference/dedup-and-discovery.md — Source/listing rules (Sources Registration, Pagination, Coverage Verification, Per-Site Crawl Memory, Per-Source Ledger, Identifier Kind, Dedup L1/L2, Company-Name Ingest, Decision Flow)
reference/ingest-and-curation.md — Per-JD rules (Detail Split, Matching Loop, Full Coverage Ingest, Exclude Flow, Reversal, Manual Edit Safety, Ingest Validation, Batch Report, Role Tagging, YAML Robustness, Rules Re-evaluation)
reference/frontmatter-schema.md — JD file YAML frontmatter contract
reference/slugify.md — Slug normalization algorithm spec
reference/url-normalize.md — URL normalization spec
reference/dedup-l2-prompt.md — L2 LLM similarity pinned prompt
reference/ambiguity-prompt.md — Matching ambiguity pinned prompt

Tests

skills/collect-jd/tests/pressure-scenarios.md — 13 pressure scenarios (Phase B TDD evidence stubs)
skills/collect-jd/evals/trigger-eval.json — trigger eval spec (flat shape)