selftune eval - selftune

Usage

selftune eval <subcommand> [options]

Subcommands

Recommended lifecycle flow

Start with the lifecycle entrypoints, not the low-level stage commands:

selftune status
selftune verify --skill-path path/to/SKILL.md
# If verify returns a next_command, run it, then rerun verify.
selftune publish --skill-path path/to/SKILL.md

The eval subcommands are the most common supporting steps that verify asks you to fill in when a draft still needs evidence. If you want to drive the advanced draft-package loop manually, the stage-level sequence is still:

selftune eval generate --skill my-skill --skill-path path/to/SKILL.md
selftune eval unit-test --skill my-skill --generate --skill-path path/to/SKILL.md
selftune create replay --skill-path path/to/my-skill --mode package
selftune create baseline --skill-path path/to/my-skill --mode package
selftune verify --skill-path path/to/SKILL.md
selftune publish --skill-path path/to/SKILL.md

The dashboard, selftune status, and per-skill report all read the artifacts from this flow to show what is still missing before you trust a live deploy, and when the skill has already moved into watch mode.

generate

Generate eval sets from real usage or synthetically:

selftune eval generate --skill my-skill
selftune eval generate --skill my-skill --auto-synthetic --skill-path path/to/SKILL.md
selftune eval generate --skill my-skill --auto-synthetic --skill-path path/to/SKILL.md --agent opencode
selftune eval generate --skill my-skill --blend --skill-path path/to/SKILL.md

Flag	Description
`--skill NAME`	Skill to generate evals for
`--list-skills`	List all skills with available data
`--stats`	Show eval generation statistics
`--max N`	Maximum entries per side to generate
`--seed N`	Random seed for reproducibility
`--output PATH`	Output file path
`--no-negatives`	Omit negative eval entries
`--no-taxonomy`	Skip `invocation_type` classification
`--skill-log PATH`	Override the skill usage log source
`--agent NAME`	Runtime agent for synthetic or blended eval generation (`claude`, `codex`, `opencode`, `pi`)
`--query-log PATH`	Override the query log source
`--telemetry-log PATH`	Override the telemetry log source
`--synthetic`	Generate from SKILL.md instead of real data
`--auto-synthetic`	Fall back to SKILL.md cold-start generation when trusted triggers do not exist
`--blend`	Merge log-based evals with synthetic gap-fillers
`--skill-path PATH`	Path to SKILL.md (required with `--synthetic`)
`--model MODEL`	Override the synthetic-generation model
`--help`	Show command help

selftune eval generate --help now prints the exact generate-subcommand surface, including cold-start and blended eval flags. If Claude Code is rate-limited or you want to force a different runtime, use --agent opencode (or codex / pi) for --synthetic, --auto-synthetic, and --blend paths. Every successful generate run also mirrors a canonical copy to:

~/.selftune/eval-sets/<skill>.json

That canonical copy is what the local dashboard and selftune status use to decide whether a skill already has eval coverage. For new draft packages, the next steps after eval generate are usually rerunning verify or, if you are driving the advanced loop manually, continuing with create replay and create baseline.

unit-test

Run or generate deterministic unit tests:

selftune eval unit-test --skill my-skill --tests path/to/tests.json
selftune eval unit-test --skill my-skill --generate
selftune eval unit-test --skill my-skill --tests path/to/tests.json --run-agent

Generated test files live under:

~/.selftune/unit-tests/<skill>.json

After a run, selftune also stores the latest suite summary at:

~/.selftune/unit-tests/<skill>.last-run.json

That stored result feeds the draft-lifecycle readiness surfaces in the dashboard, skill report, and selftune status.

composability

Analyze cross-skill interactions:

selftune eval composability --skill my-skill [--window N] [--telemetry-log PATH]

family-overlap

Detect overlap within skill families:

selftune eval family-overlap --prefix my-family-
selftune eval family-overlap --skills a,b,c [--parent-skill NAME] [--min-overlap 0.3] [--min-shared 2]

import

Import evaluation data from external sources:

selftune eval import --dir PATH --skill NAME --output PATH [--match-strategy exact|fuzzy]

See Evals concepts for more on how evaluation sets work.

Documentation Index

​Usage

​Subcommands

​Recommended lifecycle flow

​generate

​unit-test

​composability

​family-overlap

​import