Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.selftune.dev/llms.txt

Use this file to discover all available pages before exploring further.

The core problem

You can’t know how good a skill is until real people use it. And you can’t improve it without data on where it fails. This chicken-and-egg problem is why most skills ship once and never get better. If you are still in the pre-ship stage, start with Create, Test, and Deploy a Skill. This guide assumes you already have a skill in the loop. selftune breaks this cycle by observing real sessions, detecting failures, and proposing improvements — continuously.

The three stages of a skill

Most skills evolve through a predictable progression:

Stage 1: Capture the workflow

You complete a task with an AI agent. Along the way, you make corrections, provide context, and steer the agent toward the right approach. The reusable pattern in that interaction is the seed of a skill.
You: "Build me a DCF model for this company"
Agent: [tries, makes mistakes, you correct]
You: "Actually, always start with revenue assumptions"
Agent: [adjusts, produces good result]
You: "Let's capture what we just did as a skill"
At this stage, the skill is a rough draft — it works for you, with your phrasing, in your context.

Stage 2: Test and harden

Run the skill against varied inputs. Use selftune to generate eval sets from your real usage, then run evolution to improve the description:
# See how the skill is doing
selftune status

# Generate evals from real sessions
selftune eval generate --skill my-skill

# Run evolution
selftune evolve --skill my-skill --skill-path path/to/SKILL.md
At this stage, the skill works for a broader set of queries but may still have edge case failures.

Stage 3: Ship and observe

Once the skill passes your eval set reliably, ship it. Then let selftune observe how others use it:
# Set up continuous monitoring
selftune cron setup

# Or run the full autonomous loop
selftune run --skill my-skill
selftune detects when the skill fails for new query patterns and proposes description updates. If you’re using selftune Cloud, contributors can share anonymized session data back to help you improve the skill for everyone.

The selftune feedback loop

┌─────────────────────────────────────────────┐
│                                             │
│  Use skill ──→ selftune observes sessions   │
│       ↑                    │                │
│       │                    ▼                │
│  Deploy ←── Evolve ←── Detect failures      │
│                                             │
└─────────────────────────────────────────────┘
Each step in detail:

1. Observe

selftune hooks capture every user query and whether each skill triggered. This happens automatically — no manual logging required.
# Check what selftune has captured
selftune sync
selftune status

2. Detect

Grading identifies three types of problems:
  • Missed triggers — the skill should have fired but didn’t
  • Process failures — the skill fired but the agent didn’t follow instructions
  • Quality issues — the skill produced a result, but it wasn’t good
selftune grade --skill my-skill

3. Evolve

Evolution proposes improved descriptions based on the detected failures. Multiple candidates are generated and validated against your eval set:
# See what would change (dry run)
selftune evolve --skill my-skill --skill-path path/to/SKILL.md --dry-run

# Apply the best candidate
selftune evolve --skill my-skill --skill-path path/to/SKILL.md --pareto --candidates 5

4. Watch

After deploying an evolution, selftune monitors for regressions. If the new description causes problems, it rolls back automatically:
selftune watch --skill my-skill

When to iterate manually vs. automatically

SituationApproach
New skill, no usage dataManual: write description, run synthetic evals
Skill works for you, shipping to othersSemi-auto: generate evals from your sessions, run evolution
Skill is live with usersAutomatic: selftune run handles the full loop
Major workflow changeManual: update SKILL.md body, re-baseline, then resume auto

Moving logic from skills to code

As you iterate, you’ll notice parts of your skill that the agent does the same way every time. These are candidates for extraction into scripts:
Iteration 1: Skill says "validate the JSON schema by checking..."
Iteration 2: Agent keeps making the same validation mistakes
Iteration 3: Extract validation into scripts/validate.sh
Iteration 4: Skill says "run scripts/validate.sh" — faster, cheaper, reliable
This is a natural progression. Skills handle judgment; code handles mechanics. selftune’s session analysis helps you spot these patterns:
# See workflow patterns across sessions
selftune workflows --skill my-skill

Real-world example

A skill author builds a “create presentation” skill. Initial description:
description: Create PowerPoint presentations from structured data
After 50 real sessions, selftune detects that users say “slide deck,” “pitch deck,” “board deck,” and “slides” — none of which match the description. Evolution proposes:
description: >
  Create PowerPoint presentations and slide decks from structured data,
  outlines, or descriptions. USE WHEN presentation, slides, deck, pptx,
  pitch deck, board deck, keynote export.
The trigger pass rate goes from 40% to 92%. No manual effort required — selftune observed, detected, and evolved.

Next steps

Evolution Reference

Full evolution pipeline documentation.

Structuring Skills

Organize skills that scale.

Managing Context

Keep skills lean as they grow.

Testing Triggers

Verify skills fire correctly.