selftune evolve - selftune

Usage

selftune improve --skill <name> --skill-path <path> [options]
selftune evolve --skill <name> --skill-path <path> [options]

Runs the full evolution loop: generates a candidate improvement, validates it against an eval set, and deploys it if it meets the quality bar. If you want the simplified lifecycle surface, selftune improve is the front door. It maps --scope auto|description|routing|body|package onto evolve, evolve body --target ..., or bounded package search through search-run. auto still defaults to description-surface evolution unless you choose a broader scope explicitly. --confidence no longer skips validation by itself. selftune always measures the proposal against replay or judge validation first, then uses the confidence value as review metadata for warnings and adaptive-gate risk escalation.

Options

Flag	Type	Default	Description
`--skill`	string	—	Required. Skill name to evolve
`--skill-path`	string	—	Required. Path to the skill’s `SKILL.md`
`--scope`	`auto` \| `description` \| `routing` \| `body` \| `package`	`auto`	Alias-only scope selector for `selftune improve`
`--eval-set`	string	Auto-generated	Use a pre-built eval set instead of building one from logs
`--agent`	`claude` \| `codex` \| `opencode`	Auto-detected	Agent runtime to target
`--dry-run`	boolean	`false`	Validate proposals without deploying
`--confidence`	number	`0.6`	Low-confidence review threshold
`--validation-mode`	`auto` \| `replay` \| `judge`	`auto`	Validation strategy to use
`--max-iterations`	number	`3`	Maximum evolution iterations
`--pareto`	boolean	`true`	Keep Pareto multi-candidate selection enabled
`--candidates`	number	`3`	Candidate count when Pareto mode is enabled
`--token-efficiency`	boolean	`false`	Score proposals with token-efficiency weighting
`--with-baseline`	boolean	`false`	Gate deploys on a no-skill baseline lift
`--validation-model`	string	`haiku`	Model for trigger-check validation calls
`--cheap-loop`	boolean	`true`	Use cheaper models in the inner loop and a stronger gate
`--full-model`	boolean	`false`	Use one model for all stages
`--gate-model`	string	`sonnet`	Model for the final validation gate
`--gate-effort`	string	—	Override the final-gate thinking effort
`--adaptive-gate`	boolean	`false`	Escalate risky gate checks to `opus` with higher effort
`--proposal-model`	string	Agent default	Override the proposal-generation model
`--sync-first`	boolean	`false`	Sync source-truth telemetry before evolving
`--sync-force`	boolean	`false`	Force a full rescan during `--sync-first`
`--verbose`	boolean	`false`	Print detailed progress output
`--help`	boolean	`false`	Show command help

Validation modes

selftune evolve validates every proposal before deploying it. The mode controls how validation runs:

Mode	Behavior
`auto`	Uses replay validation if available, falls back to judge
`replay`	Requires replay validation; fails if unavailable
`judge`	Uses an LLM judge to score the proposal

Automatic replay validation

When --validation-mode is auto or replay and the target agent supports runtime replay, selftune automatically constructs a replay fixture from the skill’s SKILL.md. Today that includes Claude Code, Codex, and OpenCode. No --replay-fixture flag is needed. Replay validation stages the candidate skill content into a temporary local registry and observes the runtime’s actual routing decision for each eval query. Description evolution stages the proposed description; routing evolution stages the proposed ## Workflow Routing section; body evolution stages the full candidate body while preserving the original frontmatter and title. If real host/runtime replay is unavailable, auto falls back to judge validation and records a validation_fallback_reason in the audit/evidence trail. replay mode exits with REPLAY_UNAVAILABLE instead of silently downgrading to fixture simulation.

# auto mode uses replay automatically on supported hosts
selftune evolve --skill my-skill --skill-path path/to/SKILL.md

# Force judge-only (skip replay)
selftune evolve --skill my-skill --skill-path path/to/SKILL.md --validation-mode judge

How it works

Proposal generation — produces a candidate update to the skill description, routing, or body
Eval construction — builds an eval set from session history (or synthetic data for cold-start skills)
Validation — scores the proposal against the eval set using replay or judge
Pareto check — accepts the proposal only if it improves pass rate without regressing other signals
Deployment — writes the accepted proposal to the skill and records it in the audit log

Examples

# Simplified lifecycle alias
selftune improve --skill my-skill --skill-path path/to/SKILL.md --scope description --dry-run --validation-mode replay

# Bounded package search through the primary lifecycle alias
selftune improve --skill my-skill --skill-path path/to/SKILL.md --scope package --eval-set path/to/evals.json

# Keep search review-only while still using package scope
selftune improve --skill my-skill --skill-path path/to/SKILL.md --scope package --dry-run --eval-set path/to/evals.json

# Standard evolution run
selftune evolve --skill my-skill --skill-path path/to/SKILL.md

# Evolve with fresh data
selftune evolve --skill my-skill --skill-path path/to/SKILL.md --sync-first

# Force judge validation
selftune evolve --skill my-skill --skill-path path/to/SKILL.md --validation-mode judge

# Route to routing-surface evolution
selftune improve --skill my-skill --skill-path path/to/SKILL.md --scope routing --dry-run --validation-mode replay

# More iterations for stubborn skills
selftune evolve --skill my-skill --skill-path path/to/SKILL.md --max-iterations 5 --verbose

Troubleshooting

Replay unavailable in replay mode:

Error: Replay validation requested but no replay fixture or runner is available.

Switch to --validation-mode auto to allow judge fallback, or verify that the skill has a valid SKILL.md at the expected path. No eval data: Run selftune sync to ingest recent session data before evolving, or use a cold-start eval set.

Documentation Index

​Usage

​Options

​Validation modes

​Automatic replay validation

​How it works

​Examples

​Troubleshooting