name: coptic-ocr-reviewer description: "Use when: reviewing unreviewed Coptic OCR scan-page markdown exports from the printed Kephalaia paper/edition, inspecting Final Reviewer Output text for suspicious OCR inconsistencies, and applying patterns learned only from manually corrected pages p010-p029 in temp/manual_reviewer_markdown_export/out."
Coptic OCR Reviewer Skill
Use this skill to review unreviewed scan-page OCR markdown exports from the printed Kephalaia paper/edition:
temp/manual_reviewer_markdown_export/out
The skill is trained only on the manually reviewed pages p010.md through p029.md. Do not mine examples or rules from any other repository files. General Coptic reading competence may be used as language intuition, but no other repo content, dictionaries, scripts, OCR artifacts, notes, or source files should be consulted for the review.
Source Model
These exports are not OCR of the original manuscript. They are OCR of a scanned printed/scholarly page analyzing/transcribing the Kephalaia of the Teacher.
- Treat printed Coptic letters, supralinear marks, diaereses, underdots, brackets, lacuna dots, and German status notes as part of the printed transcription system when they are intentionally printed.
- Treat stray specks, smudges, isolated dots, and visual noise as likely scan/print/OCR artifacts unless the printed-page pattern clearly supports them as transcription marks.
- German status notes such as
abgeriebenorzerstortdescribe the editor's assessment of the underlying source, but the OCR review only judges the printed/scanned page export. - Do not infer physical manuscript damage from the scan itself. Use "damaged" only when referring to the printed transcription's own bracket/dot/status-note conventions.
Scope Discipline
- Use only the page markdown export being reviewed, plus the correction patterns recorded here from
p010.md-p029.md. - Do not use
p030.mdas training evidence. - Do not use later unreviewed pages as evidence for general rules.
- Treat
Reviewed: yespages as ground truth only inside the training boundary. - Treat
Reviewed: nopages as candidate OCR text, even when the section is namedFinal Reviewer Output. - Do not rewrite uncertain readings just to make them look clean. The reviewed pages preserve uncertainty, brackets, lacuna dots,
?, andEwhere the uncertainty is still real.
Page Structure
Each page markdown normally contains:
- metadata such as
Lines,Reviewed,Done lines,Special lines, andFlagged lines ## Final Reviewer Output## Pre-Manual-Correction Output## Final Reviewer X-Center Details## Pre-Manual-Correction X-Center Details
For unreviewed pages, start by reading the metadata and the Final Reviewer Output. Use the pre-manual block only as secondary evidence if the page contains both blocks and a comparison helps explain a suspicion. A text-only review must not depend on page images/scans.
Review Output Format
Report findings line by line. Do not silently edit the page unless the user explicitly asks for corrections to be applied.
Use this format:
Page pNNN.md
- LXX: suspicious reading
Observed: `...`
Proposed: `...`
Confidence: high | medium | low
Reason: concise explanation grounded in training patterns and/or Coptic continuity.
If a page has no strong textual inconsistencies, say so and list any low-confidence residual concerns separately.
What The Training Set Shows
Across p010.md-p029.md, 138 of 672 final lines differ from pre-manual OCR. Most changes are small, not wholesale rewrites. The reviewer should therefore prefer targeted correction proposals over broad reconstruction.
Common correction types:
- Raw dot punctuation inside words often becomes an underdot on the neighboring letter.
- Missing diaeresis on
ⲓis common, especially inⲁⲓ̈,ⲁϩⲣⲏⲓ̈,ⲟⲩⲁⲓ̈, andⲡϫⲁ̣ⲓ̈. - Whole-line OCR noise made from
E,?, stray brackets, and a few Coptic letters often becomes a German status note. - Left-margin starts are fragile. The reviewer often inserted missing initial letters or corrected opening brackets.
- Supralinear abbreviations and overlines are load-bearing and should be preserved exactly.
- Coptic-looking words were corrected by local language continuity, but lacuna brackets and dots were retained where the text was not recoverable.
High-Value OCR Warning Signs
Flag a line when it contains these patterns and the surrounding line is otherwise Coptic:
- Dense ASCII noise:
E,EE,EEE,?,??,?EEE,?EE?, or mixed clusters likeE??E. - Bracket noise at the beginning or end:
[[[,][,]], or a dangling]where a restoration bracket or German status note is more likely. - Dot-separated Coptic inside an otherwise continuous word:
ⲁ.ⲃⲁ.ⲗ,ⲡⲥ.ⲁ.ⲡ,ⲧ.ⲟⲛ.ⲉ,ⲡϫ.ⲁⲓ,ⲅ.ⲣⲁ,ⲡ.ⲣ̣ⲱⲙⲉ. - Question marks embedded where the target word is otherwise clear:
ⲥⲁⲣ?,ⲥⲕⲁⲛ?ⲓⲍⲉ,ⲛ̄ⲧ.ϣⲏ̣ⲣ. - A single isolated Coptic letter on a line where the surrounding page likely continues a printed damaged-text note, heading, or section.
- A left-edge missing prefix in a line that otherwise begins mid-word.
Do not automatically delete all E, ?, or dots. The reviewed pages retain them when they remain unresolved.
Dot And Underdot Behavior
The most frequent short replacement in the training set was raw . becoming combining underdot ̣. Use this only when the dot is adjacent to a letter and the result produces a plausible word or damaged-letter reading.
Observed examples:
ⲁ.ⲃⲁ.ⲗorⲁⲃ.ⲁ.ⲗcan becomeⲁⲃ̣ⲁⲗ̣orⲁⲃⲁ̣ⲗ̣depending on the letter positions.ⲡⲥ.ⲁ.ⲡbecameⲡⲥⲁ̣ⲡ̣.ⲧ.ⲟⲛ.ⲉbecameⲧⲟ̣ⲛ̣ⲉ.ⲡϫ.ⲁⲓbecameⲡϫⲁ̣ⲓ̈.ⲅ.ⲣⲁⲫⲁⲩ̣ⲉbecameⲅ̣ⲣⲁⲫⲁⲩ̣ⲉ.ⲛ̄ⲧ̣.ϣⲏ̣ⲣbecameⲛ̄ⲧ̣ϣ̣ⲏ̣ⲣ.
Keep true lacuna dots as dots, especially long runs such as ........ and dots inside brackets.
Diaeresis On Iota
The training set repeatedly adds diaeresis to iota in words where the OCR saw plain ⲓ.
Common checked forms:
ⲁⲓ̈ⲁϩⲣⲏⲓ̈ⲟⲩⲁⲓ̈ⲡϫⲁ̣ⲓ̈ⲡⲉⲓ̈ⲉⲓ̈ⲃⲉⲗⲓ̈ⲁⲣ
Flag missing diaeresis when the surrounding form strongly matches one of these. Do not add diaeresis to every iota.
Supralinear And Abbreviated Forms
These marks are not decoration. Preserve and flag damaged forms when OCR drops or corrupts the mark structure.
Observed recurring forms include:
ⲙ︤ⲛ︥ϩ︤ⲛ︥ⲛ︤ϥ︥ⲓ︤ⲏ︦ⲥ︥ⲛⲓ︤ⲏ︦ⲥ︥ⲡⲭ︤ⲣ︦ⲥ︥ⲡ︤ⲛ︦ⲁ︥ⲡ︤ⲛ︦ⲓ︦ⲕ︥ⲡⲡⲣ︤ⲕ︦ⲗ︦ⲥ︥
Common review action: preserve the abbreviation, repair the surrounding Coptic, and avoid expanding it unless the page already uses an expanded bracketed restoration.
German Status Notes
The reviewed pages use compact German status notes for text the printed edition labels as damaged, unreadable, rubbed, blank, or partly visible. These notes are intentional reviewer statuses, not Coptic text.
The Coptic OCR engine does not reliably recognize Latin characters. Raw OCR strings that look partly Latin or German are therefore not trustworthy spellings. Treat them as noisy evidence that a German status note may be needed, not as exact text to preserve. Use the corrected reviewed-note patterns below as the guide.
Established note vocabulary from the training pages:
nichtgelesenleerabgeriebenRestabgeriebengeringeSpurenganzgeringeSpurenunlesbarzerstortfastvolligzerstortfastganzzerstortzerstortundabgeriebenverwischtverwischtundabgeriebeneinstweilenunlesbar
Variant note-like strings also occur in the corrected pages, including }.Sgeringepuren, }beinstweilenunlesar, and }kstarabgerieben. Treat these as evidence that Latin status-note regions can be noisy at printed margins because the Coptic OCR is not reading Latin properly. When reviewing an unreviewed page, flag such variants as possible German status-note lines rather than confidently normalizing them.
Strong German-status-note candidates:
- A whole line or line segment is mostly
E,?, brackets, and a few Coptic letters. - The line appears in a printed damage/status-note band where adjacent lines have long dot runs.
- OCR gives sequences like
[[[E???E?EⲟE?,][?.ⲟⲓ.ⲥ̈?]???EEⲟⲓE]], or[EE?ⲟⲧⲥ̈ⲓⲧ??ⲟⲥ??E??E?E??.
Training examples:
[[[E???E?EⲟE?->nichtgelesen][?.ⲟⲓ.ⲥ̈?]???EEⲟⲓE]]->}fastvolligzerstort[EE?ⲟⲧⲥ̈ⲓⲧ??ⲟⲥ??E??E?E??...............->zerstortundabgerieben...............[ⲥ??EE?E?EE->}abgerieben][E??ⲟⲓ?E??E?E??Eⲟ?ⲥE]]->}einstweilenunlesbarEⲓⲥ???E?EⲟEE->nichtgelesen
Brackets, Braces, And Parentheses
Square brackets mark restorations or damaged/lost text and must be preserved when they are meaningful. The reviewed pages also use } at the start of some damage-note lines.
Review rules:
- A dangling
]at the beginning of a Coptic line is suspicious; it may need[or may be part of a German-status-note conversion. - Do not remove bracketed restorations merely because a word would be smoother without them.
- Empty
[]segments occur in corrected pages and should not be rejected automatically. - Parentheses can be corrected inside bracketed restorations when the OCR mangles them.
Observed examples:
]ⲕⲁⲣ]ⲡⲟ...->[ⲕⲁⲣ]ⲡⲟ...ⲧϣ... [ϥⲁⲧ[ⲣ)ⲉϥϩⲉⲣ]->...[ϥⲁⲧ(ⲣ)ⲉϥϩⲉⲣ]ⲙ̣...ⲡⲉⲧ[ⲓⲛ)ⲁ]->...ⲡⲉⲧ[(ⲛ)ⲁ]
Recurrent Coptic Continuity Checks
Use these as local plausibility anchors. Flag only when the surrounding string strongly supports the form.
Common recurring forms in the training pages:
ⲉⲕⲕⲗⲏⲥⲓⲁⲡⲕⲟⲥⲙⲟⲥⲁⲡⲟⲥⲧⲟⲗⲟⲥⲕⲁⲣⲡⲟⲥⲡⲥⲧⲩⲗⲟⲥⲡϣⲏⲛⲡⲁⲣⲙⲟⲩⲧⲉⲁⲃⲁⲗⲁϩⲣⲏⲓ̈ⲙⲙⲁϥⲛ̄ⲥⲱⲥⲡϫⲁ̣ⲓ̈ⲥⲁⲣⲝⲥⲕⲁⲛⲓⲍⲉⲇⲟⲅ̣ⲙ̣ⲁ̣
Observed corrections:
ⲡ̣ⲁⲣⲛⲧⲟⲩⲧⲉ->ⲡ̣ⲁⲣⲙⲟⲩⲧⲉⲥⲕⲁⲛ?ⲓⲍⲉ->ⲥⲕⲁⲛⲓⲍⲉⲥⲁⲣ?->ⲥⲁⲣⲝⲧⲏⲣⲟⲩ̣?EEEmay keep?EEEif unresolved but can still take a local correction before it.ⲇⲟⲅⲙⲁsometimes receives underdots asⲇⲟⲅ̣ⲙ̣ⲁ̣.
Left And Right Margin Corrections
Margins are the most unstable zone. The training pages often add, remove, or repair one to three characters at the start or end of a line.
Common additions or repairs:
- missing initial
ⲧ,ⲙ,ⲡ,ϣ̣,ⲁⲝ,ϫⲙ,ⲧⲯⲩ - initial
?removed when the following Coptic line is otherwise clear - initial
]corrected to[or converted to}before a damage note - final stray
?]removed when it is not part of a readable restoration
Examples:
- initial
ϥ...->ⲧϥ... - initial
ⲧⲁϩ...->ⲙ̄ⲙⲧⲁϩ... - initial
ϣⲁⲣⲡ...->ⲡϣⲁⲣⲡ... ?ⲉⲧⲃⲉ... ?->ⲉⲧⲃⲉ...]ⲉⲓ̈ⲡ̣ⲉ...->[ⲡ]ⲉⲓ̈ⲡ̣ⲉ...
These are usually medium-confidence unless the surrounding Coptic is very clear.
Confidence Rules
High confidence:
- A raw noisy line matches the established damage-note pattern and contains little recoverable Coptic.
- A repeated form is obvious, such as
ⲥⲁⲣ?->ⲥⲁⲣⲝ,ⲁϩⲣⲏⲓ->ⲁϩⲣⲏⲓ̈, or a dot-separated word matching a repeated token. - A supralinear abbreviation is visibly present but one mark/letter is missing.
Medium confidence:
- A left-margin letter appears missing but the line can still be read without it.
- Dot-to-underdot repair improves a plausible word but is not forced by a repeated phrase.
- A bracket orientation looks wrong but the line still has mixed damage.
Low confidence:
- The line is heavily damaged and several reconstructions are possible.
- The candidate correction depends mainly on a semantic guess.
- The change would remove uncertainty marks that the reviewed pages commonly preserve.
What Not To Do
- Do not use other repo files to validate vocabulary or readings.
- Do not normalize all damage notes to standard German spelling.
- Do not replace every dot with an underdot.
- Do not remove all
E,?, or bracket clutter automatically. - Do not expand supralinear abbreviations.
- Do not turn bracketed restorations into unbracketed text unless the reviewed-page pattern clearly supports it.
- Do not treat a clean-looking Coptic line as wrong just because it differs from pre-manual OCR.
Final Review Checklist
Before returning a review:
- Confirm the target page path and
Reviewedmetadata. - State that the review used only the page export and this
p010-p029pattern skill. - List line-level findings with confidence.
- Separate likely corrections from low-confidence questions.
- Preserve uncertainty when the evidence is not strong.