Eval Suites - selftune

Overview

Eval suites are sets of test cases that verify your skill triggers (or doesn’t trigger) on specific queries. SelfTune can suggest new cases automatically from telemetry data and recent runs. You can accept those suggestions into the draft editor or write them directly into the selected saved suite.

Suggestion review workflow

When SelfTune detects queries your skill missed, failed on, or regressed against, it surfaces them as eval suggestions on the skill’s detail page. Each suggestion shows:

The query text
Whether the skill should or should not trigger
The source (linked telemetry or hosted run)
Why it was raised (missed query, failed eval, run regression, or run failure)
Observed count and confidence

You can:

Accept into draft to open the suite editor with the case pre-filled
Save to suite to append the case directly into the selected saved suite
Dismiss to remove it from the pending queue

Saved-suite writes are still explicit operator actions. SelfTune does not silently rewrite the authoritative suite.

Reviewed suggestion history

After reviewing suggestions, they move into the Reviewed suggestion history section below the pending queue. This section shows your last 8 reviewed cases with their outcomes. Each reviewed suggestion card displays:

Whether it was Accepted or Dismissed
The provenance badge (Failed eval, Missed query, Run regression, Run failure)
The data source badge (Telemetry or Hosted run)
The query text and AI rationale
When it was last seen and when you reviewed it

Actions on reviewed suggestions

Situation	Action	What happens
You accepted a suggestion and want to add the case to the draft again	Add to draft again	Opens the suite editor with the case merged in — no need to re-review
You accepted a suggestion into the draft and now want it in the authoritative suite	Save to suite	Appends the case into the selected saved suite and records the acceptance target as `saved_suite`
You dismissed a suggestion but want to accept it now	Accept into draft	Records a new accepted review and opens the suite editor
You want to undo your review entirely	Restore to pending	Clears the review record and returns the suggestion to the pending queue

Restore to pending

Restoring a suggestion removes your review decision. The suggestion reappears in the pending queue so you (or a teammate) can review it fresh. This is useful if:

You dismissed a suggestion by mistake
A previously accepted case was removed from the suite and you want to reconsider it
A teammate should re-review the case with fresh context

If you have more than 8 reviewed suggestions, the history shows the 8 most recent. Older records are still stored and accessible via the Sources API.

Latest run pressure

The cloud skill page now also shows a Latest run pressure section sourced from the newest improve run for that skill source. This is different from the pending suggestion queue:

it shows what the latest run actually failed or regressed on
it appears even before you review those cases into the pending queue
it links back to the improve run detail page so you can inspect the result in context

If a latest-run pressure case is still net-new, you can accept it into the draft or save it directly into the selected suite from the source page.

Creating a suite from suggestions

Open the skill detail page for a cloud source.
Review pending suggestions — accept cases that represent real usage patterns your skill should handle.
Either accept the case into the draft editor or save it directly into the selected suite.
If you opened the draft editor, add more rows, set the verifier, then save.
Run the suite from the improve tab to measure coverage.

Preserved provenance

Cases accepted from suggestions retain their provenance in the saved suite:

source:
- linked_telemetry
- hosted_run
provenance:
- missed_query
- failed_evaluation
- run_regression
- run_failure

That keeps operator-authored suites auditable even as they learn from recent evidence. For programmatic workflows, use the Eval Suites API to create and manage suites directly.

​Overview

​Suggestion review workflow

​Reviewed suggestion history

​Actions on reviewed suggestions

​Restore to pending

​Latest run pressure

​Creating a suite from suggestions

​Preserved provenance