name: evaluate-koog description: Sets up evaluation of Koog AI agents using Dokimos. Use this skill when the user wants to evaluate, test, or benchmark a Koog agent, or integrate Koog with Dokimos evaluations. Also use when the user mentions Koog agents, AI agent evaluation, or agent testing with Dokimos.
Evaluate Koog Agent
Set up Dokimos evaluation for a Koog AI agent. The user will describe their agent and evaluation goals via $ARGUMENTS.
Where things live
- Koog support:
dokimos-koog/src/main/kotlin/dev/dokimos/koog/KoogSupport.kt - Trace collector:
dokimos-koog/src/main/kotlin/dev/dokimos/koog/KoogTraceCollector.kt - Koog tests:
dokimos-koog/src/test/kotlin/dev/dokimos/koog/ - Maven dependency:
dev.dokimos:dokimos-koog
Before writing code, read KoogSupport.kt to understand the available utilities.
Key functions
KoogSupport.kt provides:
asJudge(agentCall: suspend (String) -> String)— wraps any suspend function into aJudgeLMasJudge(agent: () -> AIAgent<String, String>)— wraps a Koog agent factory into aJudgeLMAIAgent.runBlocking(input, context)— extension to run a Koog agent synchronously
Setting up evaluation
Using a Koog agent as the system under test
val agent: () -> AIAgent<String, String> = { createMyAgent() }
val task = Task { example ->
val input = example.inputs()["input"] as String
val output = agent().runBlocking(input)
mapOf("output" to output)
}
val result = Experiment.builder()
.name("Koog Agent Evaluation")
.dataset(dataset)
.task(task)
.evaluator(ExactMatchEvaluator.builder().build())
.build()
.run()
Using a Koog agent as a judge
val judge = asJudge { prompt -> myAgent().run(prompt) }
// or
val judge = asJudge { createMyAgent() }
val evaluator = LLMJudgeEvaluator.builder()
.name("helpfulness")
.judge(judge)
.criteria("Is the response helpful and accurate?")
.evaluationParams(listOf(
EvalTestCaseParam.INPUT,
EvalTestCaseParam.ACTUAL_OUTPUT,
EvalTestCaseParam.EXPECTED_OUTPUT
))
.threshold(0.7)
.build()
Kotlin DSL (with dokimos-kotlin)
If the user has dokimos-kotlin as a dependency, use the DSL:
val result = experiment {
name = "Koog Agent Eval"
dataset = Dataset.fromJson(Path.of("datasets/qa.json"))
task { example ->
val output = agent().runBlocking(example.input())
mapOf("output" to output)
}
evaluator(ExactMatchEvaluator.builder().build())
}
Dependencies
The user needs dokimos-koog:
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-koog</artifactId>
<version>${dokimos.version}</version>
</dependency>
Koog itself is a provided-scope dependency — the user must bring their own version. The collector reads the event context reflectively, so one build works across Koog 0.6.4 through 1.0.0.
Evaluating tool calls, not just final output
If the Koog agent uses tools, capture its tool calls with KoogTraceCollector and evaluate them with the agent evaluators. Install the collector on the agent's event handler with collectAgentTrace, run the agent, then read the trace.
import dev.dokimos.koog.KoogTraceCollector
import dev.dokimos.koog.collectAgentTrace
val collector = KoogTraceCollector()
val agent = AIAgent(/* ... */) {
install(EventHandler) { collectAgentTrace(collector) }
}
val response = agent.run(userInput)
val testCase = collector.toAgentTrace(response).toTestCase(userInput, tools)
val validity = toolCallValidity { }.evaluate(testCase)
For the full agent evaluator set, use the evaluate-agent skill.
Steps
- Understand from
$ARGUMENTSwhat the Koog agent does and how it's constructed - Determine if the agent is the system under test, the judge, or both
- Create a dataset appropriate for the agent's domain
- Wire up the evaluation using
KoogSupportutilities - Write tests in Kotlin using MockK for mocking