You wrote a skill. It works when you test it with the exact phrase you had in mind. Then a real user says “can you check if my site is safe” and your web-assessment skill sits there silently because the description says “execute web security assessment.”Trigger testing answers a simple question: given a user query, does the right skill fire?
The fastest way to test a trigger is to run your agent with a prompt and check if the skill activated.
Claude Code
Other agents
claude -p "make me a slide deck about Q3 results" --output-format json 2>/dev/null \ | jq 'any(.messages[].content[]; .type == "tool_use" and .name == "Skill" and .input.skill == "pptx")'
This returns true if the skill was invoked, false if it wasn’t.
Most agent platforms have a way to run a prompt non-interactively and inspect the output. Check your platform’s CLI documentation for the equivalent of Claude Code’s -p (prompt) and --output-format json flags.
Run several variations:
# Should triggerclaude -p "create a presentation about our product launch"claude -p "I need slides for the board meeting"claude -p "make a pptx with these bullet points"# Should NOT triggerclaude -p "what format should I use for my presentation?"claude -p "can you review this slide deck for typos?"
The agent skills spec recommends ~20 test queries: half that should trigger, half that shouldn’t. The quality of your negative examples matters more than the positives.
# Weak negative for a CSV analysis skill:"Write a fibonacci function"# This tells you nothing. Of course it shouldn't trigger.
Strong negatives share keywords but need a different skill (or no skill):
# Strong negatives for a CSV analysis skill:"Write a script that reads a CSV and uploads each row to postgres""Convert this CSV to JSON format""Can you explain what a CSV file is?"
Testing triggers isn’t a one-time activity. It’s a loop:
1. Write or update your skill description2. Run eval set3. Identify failures4. Revise description — generalize, don't add specific keywords5. Re-run eval set6. Repeat (5 iterations is usually enough)
Avoid overfitting. If you keep tweaking the description to match specific test queries, you’ll break other queries. Split your tests into a training set (~60%) and a validation set (~40%). Optimize against the training set, then verify on the held-out set.
Evolution generates multiple candidate descriptions, validates each against your eval set, and deploys the best one — or rejects all candidates if none improve.