cheesebench-evaluating-large-language-models - SKILL.md Agent Skill

name: cheesebench-evaluating-large-language-models description: "We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maz. Activation: rodent behavior paradigms, LLM evaluation, ODE complexity" version: 1.0.0 metadata: hermes: source_paper: "CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms (arXiv:2604.10825v1)" tags: [behavior, behavioral, cognitive, learning, neuroscience, paradigm]

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

Paper Reference

Title: CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
Authors: Zacharie Bugaud
arXiv: 2604.10825v1
Published: 2026-04-12
Categories: cs.AI
PDF: https://arxiv.org/abs/2604.10825

Overview

We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement l

Core Concepts

Behavioral Neuroscience Paradigms: Classic rodent behavioral tests as LLM evaluation tasks
Cross-Species Evaluation: Bridging animal behavior research with AI evaluation
Cognitive Task Benchmarking: Systematic assessment of LLM capabilities through behavioral paradigms
Spatial Memory & Navigation: Evaluating spatial reasoning through maze-like tasks

Core Paradigms Covered

Task	Cognitive Domain	What it Tests
Morris Water Maze	Spatial learning	Navigation & memory
T-Maze	Working memory	Alternation behavior
Radial Arm Maze	Spatial reference memory	Memory capacity
Open Field	Anxiety & exploration	Risk assessment
Elevated Plus Maze	Anxiety	Risk-reward tradeoff
Fear Conditioning	Associative learning	Memory formation
Object Recognition	Recognition memory	Novelty detection
Barnes Maze	Spatial learning	Escape motivation
Social Interaction	Social behavior	Social cognition

Implementation Pattern

class CheeseBenchTask:
    """Base class for rodent behavioral paradigm tasks."""
    
    def __init__(self, name, domain, description):
        self.name = name
        self.domain = domain
        self.description = description
    
    def evaluate_llm(self, llm_response, ground_truth):
        raise NotImplementedError

class MorrisWaterMaze(CheeseBenchTask):
    """Spatial navigation and learning task."""
    
    def __init__(self):
        super().__init__(
            name="Morris Water Maze",
            domain="Spatial Learning",
            description="Navigate to hidden platform using spatial cues"
        )
    
    def generate_prompt(self, session_num=1):
        return ("Imagine you are in a circular pool. "
                "Find a hidden platform using spatial landmarks. "
                "Session: " + str(session_num))

Applications

LLM cognitive capability evaluation
Cross-modal behavioral benchmarking
AI safety assessment through behavioral paradigms
Comparative cognitive science

Limitations

Based on abstract analysis; full paper may contain additional details
Implementations are illustrative; refer to paper for production code
Domain-specific parameters need empirical tuning

References

Zacharie Bugaud (2026). "CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms." arXiv:2604.10825v1.
Full paper: https://arxiv.org/pdf/2604.10825.pdf

Activation Keywords

behavior, behavioral, cognitive, learning, neuroscience, paradigm, rodent