Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.selftune.dev/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Evals (evaluation sets) are collections of test queries annotated with expected behavior. They’re the ground truth that evolution validates against — no candidate description is deployed unless it passes the eval set.

Generating evals

Generate evals from real usage logs:
selftune eval generate --skill my-skill
Generate synthetic evals from SKILL.md content (useful for new skills):
selftune eval generate --skill my-skill --synthetic --skill-path path/to/SKILL.md

Options

FlagDescription
--max NMaximum number of eval entries to generate
--seed NRandom seed for reproducible generation
--output PATHWrite eval set to a specific file
--list-skillsList all skills with available data
--statsShow eval generation statistics

Eval structure

Each eval entry contains:
  • Query — the user input to test
  • Expected skill — which skill should trigger (or none)
  • Invocation type — explicit, implicit, contextual, or negative
  • Expected outcome — pass or fail

Unit tests

Write deterministic unit tests for skill triggers:
selftune eval unit-test --skill my-skill --tests path/to/tests.json
Generate unit tests automatically:
selftune eval unit-test --skill my-skill --generate
Run unit tests with a live agent:
selftune eval unit-test --skill my-skill --tests path/to/tests.json --run-agent

Composability analysis

Check how a skill interacts with other skills in the same agent:
selftune eval composability --skill my-skill
This analyzes a sliding window of sessions to detect:
  • Skills that compete for the same queries
  • Skills that block or interfere with each other
  • Multi-skill workflows that should be documented

Options

FlagDescription
--window NNumber of recent sessions to analyze
--telemetry-log PATHUse a specific telemetry log file

Family overlap detection

For skill families that share a common prefix, detect overlap:
selftune eval family-overlap --prefix my-family-
Or specify skills explicitly:
selftune eval family-overlap --skills skill-a,skill-b,skill-c

Options

FlagDescription
--parent-skill NAMESpecify a parent skill for hierarchy analysis
--min-overlap 0.3Minimum overlap threshold to report
--min-shared 2Minimum shared queries to report

Importing evals

Import evaluation data from external sources:
selftune eval import --dir path/to/data --skill my-skill --output path/to/eval-set.json
FlagDescription
--match-strategy exact|fuzzyHow to match queries to skills