name: cheesebench-rodent-neuroscience description: "CheeseBench benchmark for evaluating LLMs on classical rodent behavioral neuroscience paradigms. Includes 9 tasks covering water maze, T-maze, Morris water maze, and other established behavioral tests. Cross-paradigm evaluation for neuroscience AI systems." version: 1.0.0 metadata: hermes: source_paper: "CheeseBench: Evaluating LLMs on Rodent Behavioral Neuroscience (arXiv:2604.13661)" tags: [neuroscience, benchmark, llm-evaluation, rodent, behavioral, paradigm]
CheeseBench: Rodent Neuroscience LLM Evaluation
Overview
Comprehensive benchmark evaluating LLMs on classical rodent behavioral neuroscience paradigms. Contains 9 tasks covering established behavioral tests (water maze, T-maze, open field, fear conditioning, etc.) for systematic evaluation of neuroscience AI systems.
Benchmark Structure
| Task | Paradigm | Evaluation |
|---|---|---|
| 1 | Water Maze | Spatial learning/memory |
| 2 | T-Maze | Working memory |
| 3 | Open Field | Locomotor activity |
| 4 | Fear Conditioning | Associative learning |
| 5 | Morris Water Maze | Spatial reference memory |
| 6 | Elevated Plus Maze | Anxiety-like behavior |
| 7 | Social Interaction | Social behavior |
| 8 | Novel Object Recognition | Recognition memory |
| 9 | Forced Swim | Behavioral despair |
Usage
def run_cheesebench(model, task_id=None):
"""Evaluate model on CheeseBench tasks."""
if task_id:
return evaluate_single_task(model, task_id)
return evaluate_all_tasks(model)
# Tasks probe understanding of:
# - Experimental design in neuroscience
# - Behavioral interpretation
# - Statistical analysis
# - Translational relevance
Applications
- Neuroscience AI evaluation: Benchmark domain-specific reasoning
- LLM capability assessment: Test scientific reasoning in neuroscience
- Educational tools: Validate AI teaching assistants for neuroscience
- Research assistance: Evaluate AI support for experimental design
References
- Original paper: arXiv:2604.13661v1
- Published: 2026-04-15