name: manuscript-writing description: Full guide for writing a medical AI manuscript from scratch, covering all sections (Introduction, Methods, Results, Discussion, Abstract) with structure, style, and reporting standards. Use when drafting a new AI manuscript or when you need section-by-section writing instructions beyond TRIPOD-AI results reporting.
AI Manuscript Writing Guide
A section-by-section guide for writing medical AI manuscripts, from blank page to submission-ready draft.
General Style Rules
- No em dashes — use commas, parentheses, or separate sentences
- Write Results in past tense, Discussion in present tense
- Report all metrics with 95% confidence intervals
- No interpretation in Results — keep interpretation in Discussion only
- Maximum one acronym introduced per paragraph; spell out on first use
- Never use F1-score, recall, or accuracy as primary performance metrics
1. Title & Authors
Title formula: [Model/Approach] for [Task] using [Data Source]: [Design]
Example: "Deep Learning Detection of Left Ventricular Dysfunction from 12-Lead ECG: A Multicenter Validation Study"
- Keep under 15 words
- Avoid starting with "A" or "The"
- Include study design (Prospective, Multicenter, Randomized)
Authorship order:
- First author: primary contributor to experiments + writing
- Middle authors: data, methods, clinical input
- Penultimate: senior statistician or domain expert
- Last: senior/corresponding author (lab PI)
2. Abstract
Write this last, after all sections are complete.
Structured format (250–300 words):
| Subsection | Content | Sentences |
|---|---|---|
| Background | Clinical problem + gap | 2 |
| Objective | What this study does | 1 |
| Methods | Design, cohort, model, metrics | 3–4 |
| Results | Primary outcome with metric + CI | 2–3 |
| Conclusion | Key clinical implication | 1–2 |
Rules:
- No citations in abstract
- Include primary metric with 95% CI (e.g., AUROC 0.91, 95% CI 0.88–0.94)
- For classification models: include sensitivity and specificity at reported threshold
- State the threshold used (e.g., Youden Index, prespecified operating point)
3. Introduction / Background
Structure (4–5 paragraphs):
Paragraph 1 — Clinical Problem
- Open with the disease burden (incidence, mortality, cost)
- Quantify the diagnostic challenge (missed diagnoses, diagnostic delay, resource burden)
- No AI mention yet
Paragraph 2 — Current Standard of Care & Limitations
- Describe existing diagnostic approach and its limitations
- Cite reproducibility, access, cost, or inter-observer variability data
Paragraph 3 — AI in This Space (Literature)
- Trace AI approaches from traditional ML → deep learning
- Cite key prior works with their best-reported metrics
- Identify specific gaps: external validation, prospective design, subgroup fairness, real-world deployment
Paragraph 4 — What This Study Does
- State your contribution directly: "In this study, we developed and externally validated..."
- Each gap from paragraph 3 maps to one contribution here
Paragraph 5 — Roadmap (optional but recommended)
- "This paper is organized as follows..."
- Only include if journal allows or paper is complex
Forbidden in Introduction:
- Presenting results
- Vague gaps ("more work is needed")
- Overclaiming novelty without citation support
4. Methods
Required subsections, in this order:
4.1 Ethics Statement (FIRST, always)
This study was approved by [Institution] Institutional Review Board (IRB #XXXX).
Informed consent was [obtained / waived] given the [prospective/retrospective] design.
If large language models were used in manuscript preparation, disclose here.
4.2 Study Design & Data Sources
- Study type: retrospective / prospective / randomized
- Institutions involved, years of data collection
- Internal vs. external validation datasets
Flowchart: Include a CONSORT/PRISMA-style flowchart as Figure 1 showing:
- Total records screened
- Exclusions with reasons
- Final training / validation / test set sizes
- Event counts in each split
4.3 Inclusion / Exclusion Criteria
- List all criteria as bullet points
- Include age ranges, diagnostic codes, date ranges
- State who made eligibility decisions (automated vs. adjudicated)
4.4 Outcome Definition
- Primary outcome: exact definition, ICD codes, or measurement method
- Secondary outcomes if applicable
- Adjudication process
4.5 Model Development
For deep learning models:
Architecture: [name, version, input dimensions]
Pre-training: [ImageNet / ECG foundation model / from scratch]
Training: [optimizer, learning rate, batch size, epochs, early stopping criterion]
Augmentation: [list augmentations used]
Loss function: [cross-entropy, focal loss, etc.]
Hardware: [GPU model, memory, training time]
For traditional ML:
Features: [list or reference feature table]
Preprocessing: [normalization, imputation, scaling]
Model: [name, hyperparameter search strategy, CV folds]
Feature selection: [method used]
Threshold selection — REQUIRED for binary classification:
State explicitly how the operating threshold is chosen:
- Youden Index (maximizes sensitivity + specificity): recommended default
- Prespecified clinical threshold (e.g., sensitivity ≥ 90%): when clinical constraints drive the operating point
- Cost-sensitive threshold: when FP/FN costs differ clinically
Example: "The operating threshold was selected using the Youden Index (J = sensitivity + specificity − 1) on the internal validation set and applied without modification to all external validation cohorts."
4.6 Statistical Analysis
- Primary metric: AUROC with 95% CI (DeLong method)
- Secondary metrics: AUPRC, Sensitivity, Specificity, PPV, NPV at reported threshold
- Confidence intervals: bootstrap (1000 iterations) or DeLong
- Subgroup analyses: prespecified groups (age, sex, race, disease severity)
- Calibration: Brier score, calibration curves
- Comparison to baseline: likelihood ratio test or net reclassification improvement (NRI)
- Software: Python [version] / R [version] / SAS [version]
5. Results
Paragraph order must mirror Methods subsections exactly.
5.1 Cohort Description
- Total included, excluded (with reasons)
- Demographic breakdown (Table 1)
- Outcome prevalence in each split
5.2 Model Performance (Table 2)
Required metrics for classification models:
- AUROC (95% CI)
- AUPRC (95% CI)
- Sensitivity, Specificity, PPV, NPV — all at the reported threshold
- State the threshold explicitly in the table footnote
Table 2 footnote format: "Sensitivity, Specificity, PPV, and NPV calculated at the threshold maximizing the Youden Index (threshold = 0.XX on internal validation set)."
For regression models:
- Mean Absolute Error (MAE)
- Root Mean Square Error (RMSE)
- R² / concordance correlation coefficient
5.3 External Validation
- Repeat Table 2 metrics for each external cohort
- Note if threshold was re-derived or fixed from development set
5.4 Subgroup Analysis (Table 3)
- Performance stratified by key subgroups
- Forest plot recommended for visual summary
5.5 Sensitivity / Supplementary Analyses
- If applicable: performance under data augmentation, different thresholds, model ablation
Results writing rules:
- Past tense throughout
- No "why" statements — no interpretation
- Every table/figure cited in order
- Point estimate + 95% CI for every metric
6. Discussion
Structure: Answer 6 questions in order
Paragraph 1 — Main Finding (MANDATORY opening)
State the primary result and its clinical meaning in 2–3 sentences.
"We developed and externally validated a deep learning model for [task] achieving AUROC 0.91 (95% CI 0.88–0.94), demonstrating..."
Paragraph 2 — What Is Different from Prior Work
- Compare your key metrics to the best-cited prior results
- Explain why your approach differs: more data, prospective design, external validation, fairness
Paragraph 3 — What Is Similar to Prior Work
- Acknowledges concordant findings to reinforce validity
- Builds credibility rather than dismissing prior literature
Paragraph 4 — How Results Sections Correlate
- Synthesizes subgroup, calibration, and primary findings
- Draws internal consistency conclusions
Paragraph 5 — Clinical Implications for Practitioners
- Workflow integration: where does this model fit?
- Which patients benefit most?
- What human oversight is still needed?
Paragraph 6 — Research Implications
- What should the field study next?
- What methodological gaps remain?
Paragraph 7 — Limitations (MANDATORY)
Standard limitations to address:
- Retrospective design (if applicable)
- Single-center development (if applicable)
- Missing subgroup data
- Lack of prospective outcome data
- Threshold generalizability
Paragraph 8 — Conclusion (MANDATORY final paragraph)
- 2–4 sentences
- Restate main finding + clinical implication
- No new data or citations
7. Tables Reference
Table 1. Baseline Characteristics
| Variable | Development (n=X) | External Validation (n=X) |
|---|---|---|
| Age, years, mean ± SD | ||
| Female sex, n (%) | ||
| [Primary outcome], n (%) |
Abbreviations: SD = standard deviation.
Table 2. Model Performance
| Metric | Internal Validation | External Validation |
|---|---|---|
| AUROC (95% CI) | 0.91 (0.88–0.94) | 0.89 (0.85–0.93) |
| AUPRC (95% CI) | ||
| Sensitivity | ||
| Specificity | ||
| PPV | ||
| NPV |
Abbreviations: AUROC = Area under the receiver operating characteristic curve; AUPRC = Area under the precision-recall curve; PPV = Positive predictive value; NPV = Negative predictive value. Threshold: Sensitivity, Specificity, PPV, and NPV calculated at the Youden Index threshold (threshold = 0.XX on internal validation set).
Table 3. Subgroup Analysis
Report AUROC by: Age group, Sex, Disease severity (at minimum).
8. Figures Reference
| Figure | Content | Required? |
|---|---|---|
| Figure 1 | Patient/data flowchart | Always |
| Figure 2 | ROC curve (with CI band) | Classification models |
| Figure 3 | Calibration curve | Always |
| Figure 4 | Subgroup forest plot | Recommended |
| Figure 5 | Decision curve analysis | Recommended |
9. Submission Checklist
- Title ≤ 15 words, includes study design
- Abstract written last, includes threshold statement for classification
- Ethics statement is the first Methods subsection
- LLM use disclosed if applicable
- Figure 1 is a flowchart with exact exclusion numbers
- Table 2 footnote explicitly states threshold and how it was selected
- All metrics reported with 95% CI
- No F1, recall, or accuracy as primary metrics
- Results contain no interpretation
- Discussion addresses all 6 questions
- Limitations section present
- Final paragraph is Conclusion only
- Code/model available on GitHub or HuggingFace
- TRIPOD+AI checklist completed (attach as supplement)
- No em dashes anywhere in manuscript