Overview
Eval suites define the cases that a cloud improve run scores against. SelfTune currently supports two runnable verifier families:llm_judgefor trigger-query or rubric-style casesdeterministicfor exact-match, boolean, JSON-schema, andtask_packagecases
Endpoints
| Method | Path | Purpose |
|---|---|---|
GET | /api/v1/eval-suites | List suites for your organisation |
GET | /api/v1/eval-suites/{id} | Get one suite |
POST | /api/v1/eval-suites | Create a suite |
PATCH | /api/v1/eval-suites/{id} | Update a suite |
Create a suite
Request fields
| Field | Type | Required | Notes |
|---|---|---|---|
source_id | UUID string | No | Associate the suite with one cloud source |
name | string | Yes | Human-readable label |
source_kind | manual | imported | contributor_aggregate | org_exemplars | Yes | Runnable today: manual suites with llm_judge or deterministic, and imported suites when they are deterministic task_package suites |
verifier_kind | llm_judge | deterministic | structured_rubric | Yes | Only llm_judge and deterministic are runnable today |
supports_no_skill | boolean | No | Whether the suite can also evaluate a no_skill baseline |
resource_limits_json | object | No | Optional execution hints |
cases_json | array | Yes | Between 1 and 500 cases |
Runnable case kinds
llm_judge
Allowed case kinds:
trigger_queryllm_rubric
deterministic
Allowed case kinds:
exact_matchjson_schemaboolean_assertiontask_package
Task-package deterministic cases
task_package is the hosted benchmark-style lane. It is designed to stay
portable toward SkillsBench and Harbor-style task packages while using the
current SelfTune runner contract.
Required fields:
| Field | Description |
|---|---|
case_kind | Must be task_package |
environment_ref | Archive containing the environment under test |
| Field | Description |
|---|---|
instruction | Human-readable task instruction passed to the verifier as TASK_INSTRUCTION |
verifier_ref | Path inside the environment archive to the verifier entrypoint. Defaults to tests/test.sh |
skill_bundle_ref | Path inside the environment archive where SelfTune mounts the skill under test. Defaults to skill-under-test |
oracle_ref | Optional script to run before the verifier |
resource_hints | Timeout, working directory, and extra environment hints |
r2://bucket/key- an HTTP(S) URL that the runtime can fetch
Example response
Update a suite
Updates an existing eval suite. All fields are optional — only include fields you want to change.Updatable fields
| Field | Type | Notes |
|---|---|---|
name | string | New display name |
verifier_kind | llm_judge | deterministic | structured_rubric | Change verifier type |
supports_no_skill | boolean | Toggle no-skill baseline support |
resource_limits_json | object | Execution hints |
cases_json | array | Replace all cases (1–500) |
Notes
- Mixed-case suites are allowed at the schema layer, but the suite kind is only
inferred as
task_packagewhen every case istask_package. task_packageis the deterministic execution lane. It does not replace the existing SkillsBench import path, which still adapts imported query corpora intotrigger_querycases.importedis the source-kind for benchmark suites imported from external corpora or package manifests. In the current hosted flow, imported suites are runnable only when every case istask_package.- if you are importing a benchmark manifest instead of hand-authoring JSON,
start with Imported benchmark suites and the
bun run import:cloud-improve-benchmark-suitehelper.