Evals - selftune

Overview

Evals (evaluation sets) are collections of test queries annotated with expected behavior. They’re the ground truth that evolution validates against — no candidate description is deployed unless it passes the eval set.

Generating evals

Generate evals from real usage logs:

selftune eval generate --skill my-skill

Generate synthetic evals from SKILL.md content (useful for new skills):

selftune eval generate --skill my-skill --synthetic --skill-path path/to/SKILL.md

Options

Flag	Description
`--max N`	Maximum number of eval entries to generate
`--seed N`	Random seed for reproducible generation
`--output PATH`	Write eval set to a specific file
`--list-skills`	List all skills with available data
`--stats`	Show eval generation statistics

Eval structure

Each eval entry contains:

Query — the user input to test
Expected skill — which skill should trigger (or none)
Invocation type — explicit, implicit, contextual, or negative
Expected outcome — pass or fail

Unit tests

Write deterministic unit tests for skill triggers:

selftune eval unit-test --skill my-skill --tests path/to/tests.json

Generate unit tests automatically:

selftune eval unit-test --skill my-skill --generate

Run unit tests with a live agent:

selftune eval unit-test --skill my-skill --tests path/to/tests.json --run-agent

Composability analysis

Check how a skill interacts with other skills in the same agent:

selftune eval composability --skill my-skill

This analyzes a sliding window of sessions to detect:

Skills that compete for the same queries
Skills that block or interfere with each other
Multi-skill workflows that should be documented

Options

Flag	Description
`--window N`	Number of recent sessions to analyze
`--telemetry-log PATH`	Use a specific telemetry log file

Family overlap detection

For skill families that share a common prefix, detect overlap:

selftune eval family-overlap --prefix my-family-

Or specify skills explicitly:

selftune eval family-overlap --skills skill-a,skill-b,skill-c

Options

Flag	Description
`--parent-skill NAME`	Specify a parent skill for hierarchy analysis
`--min-overlap 0.3`	Minimum overlap threshold to report
`--min-shared 2`	Minimum shared queries to report

Importing evals

Import evaluation data from external sources:

selftune eval import --dir path/to/data --skill my-skill --output path/to/eval-set.json

Flag	Description
`--match-strategy exact\|fuzzy`	How to match queries to skills

Documentation Index

​Overview

​Generating evals

​Options

​Eval structure

​Unit tests

​Composability analysis

​Options

​Family overlap detection

​Options

​Importing evals

Overview

Generating evals

Options

Eval structure

Unit tests

Composability analysis

Options

Family overlap detection

Options

Importing evals