Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.selftune.dev/llms.txt

Use this file to discover all available pages before exploring further.

Why test triggers?

You wrote a skill. It works when you test it with the exact phrase you had in mind. Then a real user says “can you check if my site is safe” and your web-assessment skill sits there silently because the description says “execute web security assessment.” Trigger testing answers a simple question: given a user query, does the right skill fire?

Quick manual testing

The fastest way to test a trigger is to run your agent with a prompt and check if the skill activated.
claude -p "make me a slide deck about Q3 results" --output-format json 2>/dev/null \
  | jq 'any(.messages[].content[]; .type == "tool_use" and .name == "Skill" and .input.skill == "pptx")'
This returns true if the skill was invoked, false if it wasn’t.
Run several variations:
# Should trigger
claude -p "create a presentation about our product launch"
claude -p "I need slides for the board meeting"
claude -p "make a pptx with these bullet points"

# Should NOT trigger
claude -p "what format should I use for my presentation?"
claude -p "can you review this slide deck for typos?"

Designing test queries

The agent skills spec recommends ~20 test queries: half that should trigger, half that shouldn’t. The quality of your negative examples matters more than the positives.

Good negatives are near-misses

Weak negatives prove nothing — they’re obviously irrelevant:
# Weak negative for a CSV analysis skill:
"Write a fibonacci function"
# This tells you nothing. Of course it shouldn't trigger.
Strong negatives share keywords but need a different skill (or no skill):
# Strong negatives for a CSV analysis skill:
"Write a script that reads a CSV and uploads each row to postgres"
"Convert this CSV to JSON format"
"Can you explain what a CSV file is?"

Vary along four axes

The agent skills spec defines four invocation types. Test all four:
TypeExampleWhat you’re testing
Explicit”use the csv-analyzer skill”Agent recognizes direct invocation
Implicit”analyze this sales data spreadsheet”Agent infers from task description
Contextual”the Q3 numbers in data/sales.csv look off, can you check?”Agent picks up on context clues
Negative”write a script that reads a CSV and uploads to S3”Agent correctly does NOT trigger
Contextual queries are the hardest to get right and the most common in real usage.

Automated testing with selftune

Generate eval sets from real usage

If you’ve been using your skill, selftune can generate test queries from your actual session history:
selftune eval generate --skill my-skill
This produces a set of queries annotated with expected outcomes — grounded in how users actually talk, not how you think they talk.

Generate synthetic evals for new skills

For skills without usage history, generate synthetic evals from the draft package:
selftune verify --skill-path path/to/my-skill
selftune eval generate --skill my-skill --auto-synthetic --skill-path path/to/SKILL.md
For draft packages, a typical lifecycle progression after eval generation is:
selftune verify --skill-path path/to/my-skill
selftune eval unit-test --skill my-skill --generate --skill-path path/to/SKILL.md
selftune create replay --skill-path path/to/my-skill --mode package
selftune create baseline --skill-path path/to/my-skill --mode package
selftune verify --skill-path path/to/my-skill
selftune publish --skill-path path/to/my-skill

Run the eval set

selftune eval unit-test --skill my-skill --tests path/to/tests.json --run-agent
This runs each test query through a live agent session and checks whether the skill triggered correctly.

The optimization loop

Testing triggers isn’t a one-time activity. It’s a loop:
1. Write or update your skill description
2. Run eval set
3. Identify failures
4. Revise description — generalize, don't add specific keywords
5. Re-run eval set
6. Repeat (5 iterations is usually enough)
Avoid overfitting. If you keep tweaking the description to match specific test queries, you’ll break other queries. Split your tests into a training set (~60%) and a validation set (~40%). Optimize against the training set, then verify on the held-out set.

Let selftune handle the loop

Instead of manually iterating on descriptions, let selftune’s evolution pipeline do it:
# For new draft packages, a typical verify-driven package proof loop is:
selftune verify --skill-path path/to/my-skill
selftune eval unit-test --skill my-skill --generate --skill-path path/to/SKILL.md
selftune verify --skill-path path/to/my-skill
selftune create replay --skill-path path/to/my-skill --mode package
selftune create baseline --skill-path path/to/my-skill --mode package
selftune verify --skill-path path/to/my-skill
selftune publish --skill-path path/to/my-skill

# For existing non-draft skills, run the classic evolve loop
selftune grade baseline --skill my-skill --skill-path path/to/SKILL.md
selftune evolve --skill my-skill --skill-path path/to/SKILL.md --pareto --candidates 5
Evolution generates multiple candidate descriptions, validates each against your eval set, and deploys the best one — or rejects all candidates if none improve.

Common trigger failures

SymptomLikely causeFix
Skill never fires for implicit queriesDescription uses only technical termsAdd natural language synonyms to description
Skill fires for unrelated queriesDescription is too broadAdd negative patterns, narrow the trigger scope
Skill fires inconsistentlyDescription lacks “USE WHEN” keywordsAdd explicit trigger keyword list
Two skills compete for the same queryOverlapping descriptionsRun selftune eval composability to diagnose

Next steps

Iteration Loop

Use real usage data to continuously improve.

Evals Reference

Full eval system documentation.