buaa-classroom-summarizer - SKILL.md Agent Skill

name: buaa-classroom-summarizer description: Extract BUAA classroom replay artifacts from `livingroom` or `coursedetail` URLs by reusing a local Chromium login session. Use when Codex needs replay metadata, course-transcript files, optional PPT auxiliary artifacts, replay-ready lesson lists, or a standalone semantic rebuild packet / final lesson note.

BUAA Classroom Summarizer

Use this skill for:

one classroom.msa.buaa.edu.cn/livingroom replay URL
one classroom.msa.buaa.edu.cn/coursedetail course page

Assume commands run from this skill root. Otherwise use the absolute path to scripts/.

Core Boundary

Let scripts handle authentication, extraction, replay diagnosis, caching, and artifact writes.
Let the agent handle course alignment, concept confirmation, terminology correction, and final prose reconstruction.
Treat deterministic note output as a seed unless semantic rebuild is explicitly completed.

Main Commands

Single replay extraction:

python scripts\extract_buaa_classroom.py "<livingroom-url>" --output-dir "<output-dir>"

Whole-course replay enumeration or extraction:

python scripts\collect_buaa_course_replays.py "<coursedetail-url>" --output-dir "<output-dir>"
python scripts\collect_buaa_course_replays.py "<coursedetail-url>" --output-dir "<output-dir>" --extract-existing --skip-existing

Whole-course commands are extraction and inventory commands, not permission to generate final notes for every available lesson. For a coursedetail URL, enumerate or extract artifacts first, then ask the user which lesson to semantically rebuild next unless the user explicitly requests a batch finalization workflow.

Course Identity Rule

For a coursedetail URL, resolve the course identity before choosing a course folder or reusing vault context:

First extract the strongest available course title from classroom/SPoC metadata, saved course detail files, replay metadata, or stable replay titles.
Treat the normalized course title as the canonical course identity. If the title matches an existing course, it is the same course even when course_id, lecturer, class time, classroom, or sub_id ranges differ.
If no reliable title is available, use course_id only as a provisional identity such as course-136278; do not merge into an existing titled course by teacher, schedule, lesson dates, old extraction directories, or nearby vault content.
Preserve course_id and source URLs as source metadata, but do not use them to split courses that have the same confirmed title.
If title extraction is ambiguous and the target vault already has plausible course folders, pause for user confirmation before writing formal notes or updating trackers.

Runtime browser auth when local cookie reuse is unreliable:

python scripts\extract_buaa_classroom.py "<livingroom-url>" --output-dir "<output-dir>" --browser-runtime-auth --browser-channel "auto"

Required Replay Diagnosis

Before building any note, the script must produce one replay_diagnosis and route the replay into exactly one of:

waiting_transcript
partial_transcript
transcript_only

Downstream note logic must consume this diagnosis instead of recomputing route decisions elsewhere.

Standalone Markdown Note Workflow

Prepare a semantic rebuild packet:

python scripts\extract_buaa_classroom.py "<livingroom-url>" --output-dir "<output-dir>" --export-markdown-note

Preferred user-facing modes:

python scripts\extract_buaa_classroom.py "<livingroom-url>" --output-dir "<output-dir>" --export-markdown-note --markdown-note-mode "final-lite"
python scripts\extract_buaa_classroom.py "<livingroom-url>" --output-dir "<output-dir>" --export-markdown-note --markdown-note-mode "final-explained"

These modes must write only:

semantic_rebuild/semantic_rebuild_input.json
semantic_rebuild/semantic_rebuild_prompt.md

All final-oriented modes, including legacy final, must write only the semantic packet. Do not emit lesson_note.md by default. Treat the packet as the only intermediate artifact, then let the agent produce the final note.

Before accepting an agent-written note as final, run:

python scripts\validate_final_note.py "<final-note.md>"

If the validator fails, keep the extraction artifacts and semantic packet only. Do not rename or present the failed note as a final note.

Then create a reviewer packet:

python scripts\review_final_note.py --note "<final-note.md>" --semantic-input "<semantic_rebuild_input.json>" --output-dir "<review-dir>"

Use final_note_review/final_note_review_prompt.md with an independent reviewer agent only when the active system/developer instructions allow spawning one. If subagents are unavailable or not allowed, run a separate reviewer pass yourself with the same prompt, write the result as final_note_review/final_note_review_result.json, and do not edit the note during review.

When the agent writes the final standalone Markdown note:

use a readable lesson filename, preferably the lesson title such as 2026-04-13 贝叶斯统计第7周星期1第3,4,5节.md
do not name the final note lesson_note.md
place final lesson Markdown notes for the same course in one course folder named by course title, for example 贝叶斯统计/2026-04-13 贝叶斯统计第7周星期1第3,4,5节.md
if the chosen output directory is already named exactly as the course title, write final lesson Markdown notes directly in that directory; do not create 课程名/课程名/...
keep extraction artifacts such as metadata.json, transcript.json, and semantic rebuild packets in their original replay output directories; only user-facing final Markdown notes need the course-folder layout
start directly with the lesson title and content
do not show production metadata such as 状态, 来源, transcript coverage, replay diagnosis, or PPT extraction status in the user-facing note
pass scripts\validate_final_note.py before the note is called final
pass the reviewer gate for the current file hash before the note is called final

Batch Finalization Rule

Whole-course extraction is still not permission to blindly finalize every replay. A whole-course run may produce:

replay inventory
extraction artifacts
semantic rebuild packets
a course-level todo list

If the user explicitly asks to process all currently pending lessons, batch finalization is allowed, but it must remain lesson-by-lesson inside the batch:

Skip lessons already finalized unless the user asks for a revision.
For each candidate lesson, verify the transcript exists and is non-empty before authoring.
Read the full transcript plus the semantic packet before writing that lesson.
Run validate_final_note.py, create the review packet, and record a pass result for the current note hash before calling that lesson final.
If a lesson fails any gate, leave only artifacts/packet or a review-gated draft and continue with other eligible lessons.
Prefer running tracker/overview maintenance once after the batch, not after every lesson, unless an intermediate checkpoint is needed.
Reuse existing extraction artifacts and semantic packets when their inputs have not changed; do not rerun browser extraction just to rebuild prose.

Final Note Quality Gate

A user-facing final note must be a semantic reconstruction, not a decorated transcript segment list. Before writing or accepting a final note, reject it if it contains any of these patterns:

raw ASR/OCR snippets presented as "representative expressions" or "代表性表达"
headings such as 课堂讲解与主题推进 1
boilerplate like 整理时建议不要把这一段只当作...
repeated generic advice across sections instead of course-specific mathematical content
transcript noise such as misrecognized symbols copied into the note without correction
a course overview that marks low-quality diagnostics as "正式笔记"

If a note fails this gate, keep only the extraction artifacts and semantic packet, then mark the lesson as needing semantic rebuild. Do not call it final.

The semantic packet must not contain user-facing seed prose such as seed_bullets, raw sample_lines, or transcript_excerpt. It may contain time windows and paths to the transcript; the agent must read the transcript itself and reconstruct the note semantically.

Reviewer Gate

Finalization requires both gates on the current Markdown bytes:

scripts\validate_final_note.py passes.
The independent reviewer returns decision=pass, finalization_allowed=true, and reviewed_note_sha256 equal to final_note_review_input.json note.sha256.

If the note changes after either gate, both gate results are invalid and must be rerun.

Reviewer implementation detail:

If subagents are permitted, use an independent reviewer agent.
If subagents are not permitted by active instructions, run a separate reviewer pass in the main agent, write final_note_review_result.json, and ensure reviewed_note_sha256 matches final_note_review_input.json.
Do not rerun review for an unchanged note when an existing final_note_review_result.json already passes for the same hash.

Reviewer decisions:

pass: the note faithfully covers the transcript, handles course-domain substance, preserves supported affairs/emphasis, and is safe to present as final.
needs_revision: the transcript can support a final note, but the current note misses supported content, is too generic, or needs correction. Revise, rerun hard gate, then rerun reviewer.
reject: the current source material or note is not fit for finalization. Keep extraction artifacts and semantic packet; do not present a final note.

Absence is not failure. Missing homework, exam, grading, or deadline information is only a problem when the transcript contains evidence for it and the note omits, distorts, or invents it. If the transcript shows early dismissal, in-class exercise, student presentation, discussion, or a logistics-only class, the note may be short but must faithfully describe what happened.

Semantic Rebuild Rules

Perform a course-alignment check before accepting the rewrite as final.
Correct obvious ASR/OCR term errors when the course context makes the intended term clear.
Keep the lesson time axis visible. Each final section should keep a packet time range or a coarse MM:SS-MM:SS marker.
Keep math as $...$ or $$...$$ only. Do not wrap formulas in backticks.
Treat the course transcript as the only primary source for section boundaries, lesson mainline, and completion checks.
Only mark a lesson final when course-transcript coverage and summary coverage both pass.
Reconstruct course-specific substance. Do not substitute generic learning advice for missing semantic understanding.
If transcript.txt is missing, empty, or near-empty, treat the replay as waiting for transcript material even if a tracker lists it under backlog. Do not create a formal note from metadata, schedule, title, or PPT alone.

Authoring Contract

When writing the final student-facing Markdown note from a semantic packet:

You are writing the finished note, not a seed note, diagnostic note, or instruction to a future organizer.
Read the full transcript.txt before writing. Use semantic_rebuild_input.json only as metadata, time anchors, and artifact index.
Do not expose evidence snippets, candidate phrases, OCR fragments, raw ASR lines, or internal workflow notes.
Every major time block should explain what teaching move happened: definition, model, argument, proof, example, comparison, case discussion, policy explanation, teacher comment, assignment, exam arrangement, or class logistics.
Capture high-value classroom signals: exams, homework, deadlines, submission format, grading weight, reading requirements, teacher-emphasized key points, repeatedly stressed phrases, formulas, theorems, definitions, examples, and common mistakes.
If the teacher explicitly says something is important, likely to be tested, easy to confuse, often wrong, or needs review after class, preserve it in the note.
If transcript evidence is weak, write the item under 待核对 instead of turning it into a confident conclusion.
The final note must face the student reader directly. Avoid phrases such as “整理时应...”, “后续重写...”, “这一段主要在...”, or other process commentary.

Course-domain reconstruction guidance:

Math and statistics: reconstruct objects, definitions, assumptions, equations, theorems, proof ideas, examples, counterexamples, symbol meanings, and links between results.
Engineering and computer science: reconstruct system components, algorithms, design constraints, implementation steps, experiment setup, failure cases, trade-offs, and how formulas or code relate to the design.
Humanities and social sciences: reconstruct concepts, arguments, historical or institutional background, author positions, evidence, comparisons, cases, and the teacher's evaluative emphasis.
Ideological and political courses: reconstruct policy concepts, theoretical claims, historical context, named documents or events, value judgments, exam-oriented formulations, and examples used to explain abstract claims.
Language, writing, and communication courses: reconstruct vocabulary, rhetorical patterns, text structure, examples, correction points, practice requirements, and teacher feedback.
Lab, design, or project courses: reconstruct task goals, deliverables, tools, operation steps, data requirements, safety or format constraints, grading criteria, and troubleshooting advice.

Transcript-Only Rule

When replay_diagnosis=transcript_only:

do not emit fake content templates such as “课程定位 / 基础概念 / 方法流程”
let scripts provide only time segments from the course transcript, representative transcript lines, and transcript_overview
let the agent infer the real lesson structure from the course transcript plus course context
do not ask scripts to pre-confirm concepts from transcript-only material

PPT Rule

Prefer teacher stream by default.
Treat PPT as auxiliary only, even when a PPT stream exists.
PPT may help with term spelling, page or book titles, formula symbols, and logistics screenshots.
PPT must not decide section boundaries, lesson mainline, concept generation, or completion state.

Logistics-Only Teacher Review

If the user only wants follow-up on assignments, exams, notices, or arrangements:

python scripts\extract_buaa_classroom.py "<livingroom-url>" --output-dir "<output-dir>" --export-markdown-note --lightweight-teacher-review

This mode prepares short teacher-stream review clips and teacher_review.json. It should not silently rewrite conclusions into the note until a later confirmation step marks them as confirmed.

Failure Rules

If the course transcript is missing, keep extraction artifacts but do not invent a formal lesson note.
If the course transcript is empty or near-empty, handle it the same as missing transcript: keep it in waiting/backlog and do not write a final note.
If course-transcript coverage is clearly partial, keep only a diagnostic draft rather than a final note.
If the course transcript exists but the current summary only covers an early slice of the lesson or leaves large uncovered gaps, mark the note needs_review instead of final.
If session reuse fails, rerun with --browser-runtime-auth.

On Windows, prefer a UTF-8 shell when validating generated files. If needed, set [Console]::InputEncoding and [Console]::OutputEncoding to UTF-8 before manual Get-Content or other console inspection. For inline Python in PowerShell, use:

@'
print("hello")
'@ | python -

Do not use Bash heredoc syntax such as python - <<'PY' in PowerShell.