ijcai-experiments - SKILL.md Agent Skill

name: ijcai-experiments description: Use when designing or auditing IJCAI or IJCAI-ECAI experiments, baselines, ablations, statistical evidence, hyperparameter reporting, compute descriptions, dataset handling, ethics risks, and reproducibility evidence for AI papers.

IJCAI Experiments

Use this before submission when the experimental story is not yet locked. IJCAI reviewers can score novelty, correctness, clarity, significance, impact, presentation, ethics, and reproducibility.

Experiment audit

Map each major claim to a table, figure, theorem, ablation, proof, or qualitative analysis.
Include strong, current, and properly tuned baselines; explain any missing baseline before reviewers ask.
Report dataset splits, preprocessing, metrics, search ranges, final hyperparameters, selection criteria, random seeds or repeats, and compute infrastructure.
Add ablations for the core mechanism, not just peripheral architecture choices.
Use uncertainty estimates, paired tests, confidence intervals, or repeated runs when small differences could change the conclusion.
For sensitive data or human-facing systems, document privacy, consent, copyright, safety, fairness, misuse, and deployment limits.
Keep enough details in the main paper for credible reproduction even if reviewers ignore the supplementary material.

What IJCAI reviewers score the evidence on

IJCAI draws reviewers from symbolic AI, search, planning, constraint satisfaction, KR, multi-agent systems, game theory, ML, NLP, and vision, so the experimental section must read across subcommunities. Calibrate evidence to the claim type rather than copying an ML-only template.

Contribution type	Decisive evidence	Common reject trigger
Search / planning	Coverage, anytime quality, expansion counts, time/memory cutoffs, per-domain breakdown	Single suite, no domain table, missing strong planner baseline
Constraint / SAT	Cactus plots, instances within timeout, solver versions	No virtual-best comparison
Multi-agent / game theory	Welfare/equilibrium metrics, agent-count scaling, seeds	Claims hold at one population size only
Learning method	Strong current baselines, core-mechanism ablations, variance	Cherry-picked seeds, weak baselines
Theory-plus-experiment	Experiments confirming the proven bound	Empirics outside the theorem's regime

Worked vignette: a heuristic-search paper

A submission proposes a learned heuristic for cost-optimal classical planning and reports a single aggregate "12% fewer expansions" number. Apply the decision rules:

Evidence and baseline: replace the single mean with a per-domain coverage and expansion table, and add a strong admissible-heuristic baseline that shows optimality is preserved.
Core ablation: isolate the learned component from the search framework so the gain is not credited to engineering, and report multiple training seeds since the heuristic is stochastic.
Compute: state the planner, time and memory limits, and machine, since coverage is meaningless without a stated timeout.

Reviewer pushback and the venue-specific fix

"Only one benchmark family." Add a second problem class or justify the scope; an IJCAI cross-section reviewer distrusts single-suite claims.
"Baseline is outdated." Cite and run a current top method; the broad PC notices stale comparisons.
"Gains are within noise." Provide repeats, paired tests, or confidence intervals before the response, since no new results may be added later.

Output format

[Experiment readiness] strong / adequate / weak
[Claim -> evidence map] <claim: section/table/figure>
[Missing baseline or ablation] <item>
[Reproducibility gaps] <hyperparameters/seeds/compute/data/code>
[Decision-critical next run] <one experiment>