Testing Skill Triggers

Why test triggers?

You wrote a skill. It works when you test it with the exact phrase you had in mind. Then a real user says “can you check if my site is safe” and your web-assessment skill sits there silently because the description says “execute web security assessment.” Trigger testing answers a simple question: given a user query, does the right skill fire?

Quick manual testing

The fastest way to test a trigger is to run your agent with a prompt and check if the skill activated.

Claude Code
Other agents

claude -p "make me a slide deck about Q3 results" --output-format json 2>/dev/null \
  | jq 'any(.messages[].content[]; .type == "tool_use" and .name == "Skill" and .input.skill == "pptx")'

This returns true if the skill was invoked, false if it wasn’t.

Most agent platforms have a way to run a prompt non-interactively and inspect the output. Check your platform’s CLI documentation for the equivalent of Claude Code’s -p (prompt) and --output-format json flags.

Run several variations:

# Should trigger
claude -p "create a presentation about our product launch"
claude -p "I need slides for the board meeting"
claude -p "make a pptx with these bullet points"

# Should NOT trigger
claude -p "what format should I use for my presentation?"
claude -p "can you review this slide deck for typos?"

Designing test queries

The agent skills spec recommends ~20 test queries: half that should trigger, half that shouldn’t. The quality of your negative examples matters more than the positives.

Good negatives are near-misses

Weak negatives prove nothing — they’re obviously irrelevant:

# Weak negative for a CSV analysis skill:
"Write a fibonacci function"
# This tells you nothing. Of course it shouldn't trigger.

Strong negatives share keywords but need a different skill (or no skill):

# Strong negatives for a CSV analysis skill:
"Write a script that reads a CSV and uploads each row to postgres"
"Convert this CSV to JSON format"
"Can you explain what a CSV file is?"

Vary along four axes

The agent skills spec defines four invocation types. Test all four:

Type	Example	What you’re testing
Explicit	”use the csv-analyzer skill”	Agent recognizes direct invocation
Implicit	”analyze this sales data spreadsheet”	Agent infers from task description
Contextual	”the Q3 numbers in data/sales.csv look off, can you check?”	Agent picks up on context clues
Negative	”write a script that reads a CSV and uploads to S3”	Agent correctly does NOT trigger

Contextual queries are the hardest to get right and the most common in real usage.

Automated testing with selftune

Generate eval sets from real usage

If you’ve been using your skill, selftune can generate test queries from your actual session history:

selftune eval generate --skill my-skill

This produces a set of queries annotated with expected outcomes — grounded in how users actually talk, not how you think they talk.

Generate synthetic evals for new skills

For skills without usage history, generate synthetic evals from the draft package:

selftune verify --skill-path path/to/my-skill
selftune eval generate --skill my-skill --auto-synthetic --skill-path path/to/SKILL.md

For draft packages, a typical lifecycle progression after eval generation is:

selftune verify --skill-path path/to/my-skill
selftune eval unit-test --skill my-skill --generate --skill-path path/to/SKILL.md
selftune create replay --skill-path path/to/my-skill --mode package
selftune create baseline --skill-path path/to/my-skill --mode package
selftune verify --skill-path path/to/my-skill
selftune publish --skill-path path/to/my-skill

Run the eval set

selftune eval unit-test --skill my-skill --tests path/to/tests.json --run-agent

This runs each test query through a live agent session and checks whether the skill triggered correctly.

The optimization loop

Testing triggers isn’t a one-time activity. It’s a loop:

Write or update your skill description
Run eval set
Identify failures
Revise description — generalize, don't add specific keywords
Re-run eval set
Repeat (5 iterations is usually enough)

Avoid overfitting. If you keep tweaking the description to match specific test queries, you’ll break other queries. Split your tests into a training set (~60%) and a validation set (~40%). Optimize against the training set, then verify on the held-out set.

Let selftune handle the loop

Instead of manually iterating on descriptions, let selftune’s evolution pipeline do it:

# For new draft packages, a typical verify-driven package proof loop is:
selftune verify --skill-path path/to/my-skill
selftune eval unit-test --skill my-skill --generate --skill-path path/to/SKILL.md
selftune verify --skill-path path/to/my-skill
selftune create replay --skill-path path/to/my-skill --mode package
selftune create baseline --skill-path path/to/my-skill --mode package
selftune verify --skill-path path/to/my-skill
selftune publish --skill-path path/to/my-skill

# For existing non-draft skills, run the classic evolve loop
selftune grade baseline --skill my-skill --skill-path path/to/SKILL.md
selftune evolve --skill my-skill --skill-path path/to/SKILL.md --pareto --candidates 5

Evolution generates multiple candidate descriptions, validates each against your eval set, and deploys the best one — or rejects all candidates if none improve.

Common trigger failures

Symptom	Likely cause	Fix
Skill never fires for implicit queries	Description uses only technical terms	Add natural language synonyms to description
Skill fires for unrelated queries	Description is too broad	Add negative patterns, narrow the trigger scope
Skill fires inconsistently	Description lacks “USE WHEN” keywords	Add explicit trigger keyword list
Two skills compete for the same query	Overlapping descriptions	Run `selftune eval composability` to diagnose

Testing Skill Triggers

Why test triggers?

Quick manual testing

Designing test queries

Good negatives are near-misses

Vary along four axes

Automated testing with selftune

Generate eval sets from real usage

Generate synthetic evals for new skills

Run the eval set

The optimization loop

Let selftune handle the loop

Common trigger failures

Next steps

Iteration Loop

Evals Reference

Documentation Index

​Why test triggers?

​Quick manual testing

​Designing test queries

​Good negatives are near-misses

​Vary along four axes

​Automated testing with selftune

​Generate eval sets from real usage

​Generate synthetic evals for new skills

​Run the eval set

​The optimization loop

​Let selftune handle the loop

​Common trigger failures

​Next steps

Iteration Loop

Evals Reference

Why test triggers?

Quick manual testing

Designing test queries

Good negatives are near-misses

Vary along four axes

Automated testing with selftune

Generate eval sets from real usage

Generate synthetic evals for new skills

Run the eval set

The optimization loop

Let selftune handle the loop

Common trigger failures

Next steps