name: manuscript-review description: Comprehensive checklist for reviewing academic medical AI manuscripts, covering objective identification, Background, Methods, Results, Tables, Discussion, and Abstract sections with success criteria for each. Use when reviewing or evaluating a manuscript draft.
Claude Code Skill: AI Manuscript Section Reviewer
This skill iterates through each section of a medical AI manuscript and outputs a success checklist based on predefined criteria.
Usage
Review my manuscript for [SECTION_NAME] compliance
or
Run full manuscript checklist
Reusable assets in this skill:
scripts/typography_qc.py: regex-based typography and abbreviation audit for pre-submission draftstemplates/label_noise_sensitivity_template.md: reviewer-ready label-noise sensitivity analysis templatetemplates/failure_mode_panel_template.md: supplementary failure-mode panel templatetemplates/failure_mode_panel_template.ipynb: executable failure-mode panel notebook skeletontemplates/consort_flow_table1_template.md: CONSORT-style flow and Table 1 denominator template
Step 0: Objective Identification
Before reviewing any section, identify and classify the study objectives.
Iteration Logic:
EXTRACT objectives from manuscript:
IDENTIFY primary objective (exactly one)
IDENTIFY secondary objective(s)
FOR each objective:
CHECK that data source is specified
CHECK that model development approach is described (if applicable)
CHECK that evaluation criteria are defined
CHECK that sensitivity analysis is planned
BUILD objective-to-methods mapping
VERIFY that Methods and Results trace back to this mapping
Success Checklist:
- Single Primary Objective: Exactly one primary objective stated
- Secondary Objectives Listed: Each secondary objective is distinct and measurable
- Data Source Mapped per Objective: Each objective specifies which dataset(s), cohort(s), or registry(ies) it uses
- Model Development Mapped per Objective: Each objective that involves a model specifies architecture, training strategy, and validation approach
- Evaluation Criteria Mapped per Objective: Each objective has discrimination, calibration, and threshold metrics defined
- Sensitivity Analysis Mapped per Objective: Each objective specifies which assumptions are varied and how robustness is assessed
- Objectives Traceable Through Manuscript: Every Methods subsection and Results subsection maps to an objective
Section Definitions & Success Checklists
1. Background Section
Purpose: Establish clinical importance, trace AI evolution, identify gaps, and state contributions.
Iteration Logic:
FOR each paragraph in Background:
IDENTIFY which subsection it belongs to:
- Clinical importance
- AI evolution
- Current gaps
- Contribution statement
EVALUATE against criteria below
OUTPUT pass/fail for each criterion
Success Checklist:
- Clinical Importance Present: Core problem is defined with quantified scale (e.g., annual volume, error rates, time delays)
- Unmet Need Articulated: Clear statement of what current practice lacks
- AI Evolution Traced: Historical progression from traditional ML to deep learning with key citations
- State-of-the-Art Benchmarked: Current best performance cited with metrics
- Gaps Are Specific: Each gap is actionable, not vague ("more work needed" = FAIL)
- Gaps Map to Contributions: Every gap directly motivates a contribution
- Paper Roadmap Included: Reader knows what to expect in subsequent sections
- No Overclaiming: Novelty stated relative to prior work without exaggeration
2. Methods Section
Purpose: Describe ethics, data, modeling, and statistical analysis in reproducible detail.
Iteration Logic:
FOR each subsection in Methods:
CHECK presence and order:
1. Ethics & LLM disclosure (MUST be first)
2. Data Source for Model Development
3. Data Preprocessing
4. Model Development
5. External Validation (optional)
6. Model Evaluation and Statistical Methods
VERIFY each subsection maps to primary or secondary objective
CHECK that all other methodological detail is routed to Supplement
EVALUATE against criteria below
OUTPUT pass/fail for each criterion
Success Checklist:
- Ethics Approval First: Institutional Review Board (IRB)/ethics statement appears before any data description
- LLM Use Disclosed: If Large Language Models (LLMs) used in writing/analysis, disclosed upfront
- Data Source Described: Dataset(s) named with full expansion at first mention (e.g., "The Canadian Longitudinal Study on Aging (CLSA)"), years spanned, geographic scope, sample size, inclusion/exclusion criteria
- Data Source Mapped to Objectives: Clear which dataset serves the primary vs. secondary objectives
- Flowchart Present: CONSORT-style diagram showing patient/sample flow (Figure 1)
- Data Preprocessing Described: Signal/image processing, feature extraction, missing data handling, train/validation/test split
- Model Development Complete: Architecture, training details, hyperparameters documented; each model mapped to its objective
- External Validation Described (if applicable): External cohort source, size, and any domain adaptation; if absent, justified and acknowledged as limitation
- Evaluation Criteria Defined per Objective: Discrimination, calibration, threshold selection, and clinical utility metrics specified for each objective
- Sensitivity Analysis Specified: Assumptions to be varied are listed for both primary and secondary objectives
- Statistical Methods Specified: Tests, Confidence Intervals (CI), significance thresholds defined
- Supplement Routing Correct: Extended preprocessing, hyperparameter search, ablation experiments, heterogeneity assessment, model updating, and extended sensitivity analyses are in Supplement, not main text
- TRIPOD-AI Compliant: Cross-reference with Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD)-AI checklist items
3. Results Section
Purpose: Present findings factually without interpretation.
Iteration Logic:
FOR each paragraph in Results:
CHECK that:
- No interpretation or "why" statements present
- Order matches Methods subsections exactly
- Primary objective results come first
- Secondary objective results follow in separate subsections
- Sensitivity analysis results have a dedicated subsection
FOR each table/figure reference:
- Verify sequential order
OUTPUT pass/fail for each criterion
Success Checklist:
- Leads with cohort description
- Table 1 = Baseline Characteristics: Demographics, splits, clinical variables appear first
- Results Mirror Methods Order: 1:1 subsection correspondence maintained
- Primary Objective Results First: Model performance on the primary objective presented before secondary
- Secondary Objective Results Separated: Each secondary objective in its own subsection with its data source and model referenced
- Sensitivity Analysis Results Present: Dedicated subsection showing robustness of primary and secondary findings
- No Interpretation Present: Zero sentences explaining "why" or comparing to literature
- Tables/Figures in Order: Referenced sequentially as they appear
- No Em Dashes: Use commas, parentheses, or separate sentences instead
- Statistics Complete: Point estimates with 95% CI or p-values included
4. Tables Section
Purpose: Present data in standardized, journal-compliant format.
Iteration Logic:
FOR each table in manuscript:
EXTRACT metrics and formatting
VALIDATE against numerical rules
CHECK structural requirements
CHECK abbreviation format
OUTPUT pass/fail for each criterion
Numerical Formatting Rules:
| Value Type | Decimals | Example |
|---|---|---|
| AUROC, AUPRC, Sensitivity, Specificity | 2 | 0.85 |
| Values < 0.1 | 2 or 3 | 0.024 |
| Percentages | 1 | 7.6% |
| 95% CI | Match point estimate | 0.85 (0.79 to 0.91) |
Success Checklist:
- Correct Metrics Used: AUROC, AUPRC, Sensitivity, Specificity, PPV, NPV only (NO F1, recall, accuracy)
- Classification Threshold Reported: For binary classification models, the decision threshold used to derive Sensitivity/Specificity/PPV/NPV is explicitly stated in the table footnote (e.g., "threshold selected at Youden Index" or "threshold = 0.XX"). FAIL if threshold is missing or ambiguous.
- Threshold Selection Method Stated: The method for selecting the threshold is named: Youden Index, prespecified clinical sensitivity target, or cost-sensitive criterion. FAIL if Sensitivity/Specificity are reported without specifying how the operating point was chosen.
- Cited Before Appearance: Table referenced in text before it appears
- Title Present: Descriptive title above table, no period at end
- Abbreviations Defined: Listed below table in a single paragraph, using colon after abbreviation and semicolons between entries. Format:
Abbreviations: LVEF: Left Ventricular Ejection Fraction; AUROC: Area Under the Receiver Operating Characteristic Curve; PPV: Positive Predictive Value - Abbreviations Expanded at First Use in Text: Every abbreviation expanded with full name and abbreviation in parentheses at first occurrence in the manuscript body
- 95% CI in Parentheses: Uses "to" for ranges unless the journal explicitly requires another style: (0.79 to 0.91)
- Two Decimals Standard: 0.85, not 0.850 or 0.8
- Three Decimals for Small Values: Values < 0.1 use three decimals
- Sample Sizes Complete: Format: events/total (X.X%)
- Consistent CI Spacing: 0.85 (0.79 to 0.91) with space before parenthesis
- No Vertical Lines: Horizontal rules only
5. Discussion Section
Purpose: Interpret findings, compare to literature, acknowledge limitations, and conclude. A good discussion section should answer 6 key questions.
The 6 Questions Framework:
- What is different in your findings compared to previous research?
- What is similar in your findings compared to previous research?
- How do different sections of your results correlate?
- What are the implications of your findings for practitioners?
- What are the implications of your findings for researchers?
- What are the limitations or threats to the validity of your findings?
Iteration Logic:
FOR each paragraph in Discussion:
IDENTIFY paragraph type:
- Main finding interpretation (MUST be paragraph 1)
- Literature comparison: differences (Question 1)
- Literature comparison: similarities (Question 2)
- Results correlation/synthesis (Question 3)
- Clinical implications (Question 4)
- Research implications (Question 5)
- Limitations (Question 6, MUST be present)
- Conclusion (MUST be final paragraph)
EVALUATE against criteria below
OUTPUT pass/fail for each criterion
Success Checklist:
- First Line = Main Finding: Opening sentence states the key interpretation. Leads with Accomplishment: First sentence states what was achieved
- Differences from Prior Work Addressed: Explicitly states what is novel or contradictory vs. existing literature
- Similarities to Prior Work Addressed: Acknowledges concordant findings that reinforce validity
- Results Sections Correlated: Synthesizes how different analyses (e.g., subgroups, sensitivity analyses) relate to each other
- Practitioner Implications Stated: Clear guidance on how findings affect clinical practice or workflow
- Researcher Implications Stated: Identifies future research directions or methodological contributions
- Limitations Clearly Present: Dedicated section acknowledging study weaknesses and threats to validity
- Final Paragraph = Conclusion: Last paragraph summarizes implications
- No New Results Introduced: All data presented in Results section only
6. Abstract Section
Purpose: Summarize entire manuscript in structured format.
Iteration Logic:
VERIFY that abstract is written LAST
FOR each abstract subsection:
CHECK alignment with corresponding full section
VERIFY word count compliance
OUTPUT pass/fail for each criterion
Success Checklist:
- Written Last: Abstract drafted after all other sections complete
- Background Present: 1-2 sentences on clinical problem and gap
- Methods Summarized: Study design, cohort, key methods in brief
- Results Highlighted: Primary outcome with key metric and CI
- Conclusion Stated: Clinical implication in 1-2 sentences
- Word Count Compliant: Within journal's abstract limit
- No Citations in Abstract: References belong in main text only
Reviewer-Pattern Pre-Submission Checklist for Medical-AI Papers
Use this checklist after the section-level review. Keep it generic and apply it to any clinical AI manuscript, regardless of modality or disease area. The goal is to remove predictable reviewer objections before submission.
1. Foundation-model and novelty framing
- If the manuscript uses "foundation model", "general-purpose", or "universal", require operational evidence.
- Acceptable evidence includes zero-shot or few-shot generalization on at least one held-out task without supervised fine-tuning, a self-supervised retrieval probe, or pre-training across more than one institution or domain.
- If this evidence is absent, reframe the claim as a domain-specific multi-task model, contrastive model, or task-specific model.
- Require a glossary or Discussion paragraph defining the term exactly as used by the authors.
2. Reference-standard symmetry between internal and external cohorts
- Verify that internal and external performance are compared against similar label quality.
- If internal validation uses a gold standard and external validation uses report-derived, billing-code, single-reader, or NLP-derived labels, require a like-for-like internal metric using the noisier label.
- Require an expert audit of a random sample from the noisier label source.
- Propagate estimated label error as a sensitivity bound around the external metric.
3. Prognostic and survival sub-studies
- Flag prognostic claims as fragile when patient count is under 500, event count is under 100, median follow-up is under 12 months, or patients without follow-up are excluded rather than censored.
- Require patient count, event count, median follow-up, maximum follow-up, and censoring approach in the main manuscript.
- Decompose composite endpoints. Report hard endpoints such as death and myocardial infarction separately from softer endpoints such as revascularization.
- Require inverse-probability-of-follow-up sensitivity analysis or worst-case imputation when follow-up is incomplete.
4. Clinical positioning of automated interpretation
- Avoid framing the AI as real-time operator assistance when a trained operator already interprets the input at the same speed.
- Require at least three non-real-time use cases, such as structured reporting, variance reduction, retrospective research at scale, trainee education, audit or quality control, or downstream multimodal integration.
- Frame clinical use as workflow support unless prospective deployment data proves decision support value.
5. Label noise and report-derived ground truth
- For labels derived from clinical reports, LLM extraction, or NLP pipelines, require a human-adjudicated audit on a random sample.
- Report per-class precision and recall for the extraction process.
- Add a confusion matrix between report-derived and adjudicated labels when both exist.
- Require a retrain-on-clean-labels or evaluate-on-clean-labels sensitivity analysis when feasible.
6. Hardware, vendor, and acquisition variation
- Require counts by scanner, vendor, acquisition protocol, or equivalent modality-specific metadata.
- For DICOM imaging, extract Manufacturer and ManufacturerModelName when available.
- Report per-vendor or per-protocol performance in Extended Material when sample size permits.
- Discuss any single-vendor or single-protocol limitation directly.
7. Architecture and hyperparameter sensitivity
- Identify one hyperparameter most likely to affect the headline result and most likely to be challenged by reviewers.
- Require a short-schedule ablation on that parameter when feasible, for example patch size, frame count, stride, learning-rate schedule, augmentation intensity, or encoder freeze ratio.
- Report the ablation as a robustness check, not as a full model search.
8. Failure-mode analysis
- Require a failure-mode supplementary panel before submission.
- Include a confusion matrix stratified by clinically meaningful category, such as anatomic region, territory, cohort, scanner, or disease severity.
- Include a calibration plot and 6 to 12 representative false-positive and false-negative cases with image panels when permitted and one-line clinical narratives.
- Add an operator-vs-model concordance heatmap when the manuscript compares model output to clinician decisions.
9. Unit of analysis clarity
- Require a Methods paragraph titled "Unit of analysis".
- State the unit for training, validation, evaluation, and reporting, such as pixel, frame, video, lesion, vessel, study, encounter, or patient.
- Tag every Results subsection with the denominator used for the metric.
- Flag any metric where the unit changes between training, evaluation, and reporting without explanation.
10. Subgroup performance and fairness
- Require sex-stratified and age-stratified model performance for every classification paper.
- Add race, ethnicity, socioeconomic status, and disease-severity strata when available and appropriate.
- If women represent less than 35% of the cohort, require a limitations sentence addressing underrepresentation and possible implications for non-obstructive disease, microvascular disease, and atypical presentations when clinically relevant.
- Report interaction testing or explicitly state when the study is underpowered for subgroup interaction tests.
11. Statistical reporting and typography
- Report AUROC and similar headline metrics to two decimals in the abstract and main text.
- Do not use three-decimal precision in the abstract.
- Use "to" for numeric ranges unless journal style requires otherwise.
- Use lowercase italic p for p values.
- Avoid
vs.,approx,i.e., ande.g.in polished submission text. - Use
mean (SD), not mean plus/minus SD notation. - Use Oxford commas, "compared with", and
n =with spaces. - Run
python manuscript-review/scripts/typography_qc.py <draft.txt>before submission.
12. Abbreviation discipline
- Define every abbreviation at first mention within each major section: Abstract, Introduction, Methods, Results, and Discussion.
- Treat the abbreviation list as a backstop, not a substitute for in-section definition.
- Regex-audit all all-caps tokens of length 2 to 5 and require the expansion within 80 characters of first occurrence in each section.
13. Figure clarity and CONSORT cohort flow
- Require a CONSORT-style flow diagram when more than 20% of available data is excluded.
- Every exclusion branch must include reason and count.
- Confirm all figure labels remain legible at print resolution and 100% scale.
- Flag overlapping labels, missing denominators, and unclear panel references.
14. Public code, data, and reproducibility infrastructure
- If code is public, verify the repository is reviewer-readable.
- Require a
dataset_creation/README.mdor equivalent file describing input schema, expected metadata, preprocessing steps, LLM prompts used for label extraction, and validation procedures. - Include the repository URL in the abstract or data/code availability statement when allowed by journal policy.
15. Prospective evaluation and journal targeting
- At submission time, state whether prospective deployment, prospective reader study, or silent-mode evaluation data exist.
- If none exist, treat absence of prospective evidence as a substantive limitation for top-tier general medical and digital-health journals.
- Align target journal choice with the evidence level: retrospective external validation is baseline, not a differentiator.
16. Discordance versus decisions reporting
- When model output is compared with operator or clinician decisions, report discordance rate and characterize discordant cases.
- Use humble framing. Discordance is a signal for review or hypothesis generation, not proof that the model is right.
- Include a table stratifying discordance by decision type, risk group, anatomic region or modality-specific category, and available outcome.
17. Agreement and continuous-measurement validation
- For models that estimate a continuous clinical quantity or substitute for a reference device, require Bland-Altman analysis with limits of agreement in the intended clinical population. Pearson r, coefficient of determination, or intraclass correlation alone quantify association, not agreement, and do not establish clinical accuracy.
- Require the limits of agreement to be interpreted against the minimum clinically important difference, the effect size that matters in the domain. A tool with limits of agreement of plus or minus 10 mL/kg/min cannot support a use case built on a 1.5 mL/kg/min treatment effect.
- Flag systematic directional bias in the intended population. Errors that do not average out, shown as a slope or offset in the Bland-Altman residuals within the target disease group, are disqualifying even when overall performance looks acceptable.
- Flag demographic confounding. When model inputs overlap substantially with known clinical determinants such as age, sex, and body mass index, require isolation of the novel signal. Otherwise the reported correlation may be demographic rather than physiological.
- Treat regulatory clearance as context, not validation. United States Food and Drug Administration (FDA) 510(k) clearance means substantial equivalence to a predicate device, not demonstrated clinical utility. Do not accept 510(k), CE mark, or commercial deployment as evidence of clinical readiness.
- Require methodology that is internally consistent and independently verifiable. Flag pipelines described differently between the abstract and Methods, and proprietary or opaque methods that cannot be reproduced or audited.
Cross-cutting submission discipline
- The manuscript must be readable end-to-end without requiring the supplement for sample sizes, denominators, or headline metric definitions.
- Abstract headline metrics must match the corresponding table values exactly to the second decimal.
- Build the reusable artifacts before submission: typography QC output, label-noise sensitivity table, failure-mode panel, CONSORT flow, and Table 1 denominators.
Full Manuscript Review Command
RUN full_review():
sections = [Objectives, Background, Methods, Results, Tables, Discussion, Abstract, ReviewerPatterns]
FOR section in sections:
PRINT "=== Reviewing {section} ==="
RUN section_checklist(section)
PRINT summary(passed, failed, total)
PRINT "=== MANUSCRIPT REVIEW COMPLETE ==="
RETURN overall_compliance_score
Quick Reference: Forbidden Patterns
| Do Not Use | Use Instead |
|---|---|
| F1-score | AUROC, AUPRC |
| Recall | Sensitivity |
| Accuracy | Sensitivity + Specificity |
| Em dash characters | Commas, parentheses |
| Hyphens or dashes in CI ranges | to, for example 0.79 to 0.91 |
| Interpretation in Results | Move to Discussion |
| Vague gaps | Specific, actionable gaps |
| Abbreviations with "=" | Use colon: LVEF: Left Ventricular Ejection Fraction |
| Abbreviations separated by commas | Use semicolons between entries |
| Unexpanded abbreviations | Expand at first use: Full Name (ABBREVIATION) |
vs. |
versus or compared with |
i.e. or e.g. |
that is, for example, or direct wording |
mean +/- SD |
mean (SD) |
| Pearson r or ICC alone for agreement | Bland-Altman with limits of agreement |
| 510(k) framed as clinical validation | State as substantial equivalence to a predicate device only |