scoring-calibration - SKILL.md Agent Skill

name: scoring-calibration description: > Skill for venue-calibrated scoring, score weighting formulas, decision rules, anti-bias mechanisms, and score interpretation across different venue tiers.

Scoring Calibration

Use this skill when computing review scores, applying decision rules, or calibrating review standards to a specific venue.

Score Dimensions

Every review scores these 6 dimensions plus confidence:

Dimension	Range	Description
Overall	1-10	Holistic assessment
Soundness	1-10	Technical correctness
Novelty	1-10	Originality of contribution
Clarity	1-10	Writing and presentation quality
Significance	1-10	Impact and importance
Reproducibility	1-10	Can results be reproduced?
Confidence	1-5	Reviewer's self-assessed expertise

Venue-Calibrated Interpretation

Top-Tier (NeurIPS, Nature, Science, ICML)

Score	Meaning
8-10	Strong accept — top 10% of submissions
6-7	Weak accept — above threshold, some issues
5	Borderline — could go either way
3-4	Weak reject — below threshold, significant issues
1-2	Strong reject — fundamental flaws

Acceptance threshold: Mean ≥ 7, no critical issues

Mid-Tier (AAAI, ECML, PLOS ONE)

Score	Meaning
7-10	Strong accept
5-6	Accept with revisions
4	Borderline
2-3	Reject
1	Strong reject

Acceptance threshold: Mean ≥ 6, critical issues addressed

Workshop / Preprint

Score	Meaning
6-10	Accept
4-5	Accept with minor revisions
3	Borderline
1-2	Reject

Acceptance threshold: Mean ≥ 5, no fatal flaws

Score Weighting Formula

The weighted final score combines dimensions with fixed weights:

final_score = (
    0.30 × mean(soundness) +
    0.20 × mean(novelty) +
    0.20 × mean(significance) +
    0.15 × mean(clarity) +
    0.15 × mean(reproducibility)
)

These weights can be overridden in .review-config.yaml:

review:
  score_weights:
    soundness: 0.30
    novelty: 0.20
    significance: 0.20
    clarity: 0.15
    reproducibility: 0.15

Decision Rules

Condition	Decision
All reviewers ≥ 7, no critical weaknesses	Accept
All reviewers ≥ 6, only minor weaknesses	Accept with Minor Revision
Mean ≥ 5, no more than 1 reviewer below 5	Major Revision
Mean < 5 or 2+ reviewers below 4	Reject
Strong disagreement (spread ≥ 4 points)	Discussion round before decision

Venue-Adjusted Thresholds

The decision rules above use venue-specific thresholds:

Rule Parameter	Top-Tier	Mid-Tier	Workshop
Accept threshold	≥ 7	≥ 6	≥ 5
Accept-minor threshold	≥ 6	≥ 5	≥ 4
Major revision threshold	≥ 5	≥ 4	≥ 3
Reject threshold	< 5	< 4	< 3

Anti-Bias Mechanisms

Anchoring Prevention

Reviewers assign scores BEFORE writing detailed comments
Score-first protocol prevents narrative from biasing quantitative assessment

Confirmation Bias Mitigation

Reviewer γ (Generalist) has no domain priors — provides an unbiased perspective
If all reviews are uniformly positive (all ≥ 7), flag for confirmation bias check

Authority Bias Prevention

Author identity optionally stripped in double-blind mode
Reviewer profiles focus on expertise, not prestige

Positivity Bias Prevention

EIC prompt emphasizes that rejection is a valid and useful outcome
Decision rules explicitly model rejection conditions

Novelty Bias Prevention

Score weights rank soundness (0.30) above novelty (0.20)
A technically correct but incremental paper scores higher than a novel but unsound one

Score Trajectory Tracking

Track scores across revision rounds to detect convergence or stalling:

score_trajectory:
  round_1:
    alpha: 5
    beta: 7
    gamma: 6
    mean: 6.0
    weighted: 5.95
  round_2:
    alpha: 7
    beta: 8
    gamma: 7
    mean: 7.3
    weighted: 7.25
    delta: +1.3
  convergence_status: "improving"

Diminishing Returns

If delta ≤ 0.3 for 2 consecutive rounds:
  → Flag DIMINISHING_RETURNS
  → Consider declaring EXHAUSTED