Eval Suites API - selftune

Overview

Eval suites define the cases that a cloud improve run scores against. SelfTune currently supports two runnable verifier families:

llm_judge for trigger-query or rubric-style cases
deterministic for exact-match, boolean, JSON-schema, and task_package cases

Use deterministic suites whenever you can express the check as code or a benchmark task package. They are cheaper to run and easier to audit.

Endpoints

Method	Path	Purpose
`GET`	`/api/v1/eval-suites`	List suites for your organisation
`GET`	`/api/v1/eval-suites/{id}`	Get one suite
`POST`	`/api/v1/eval-suites`	Create a suite
`PATCH`	`/api/v1/eval-suites/{id}`	Update a suite

All endpoints require an API key:

Authorization: Bearer st_live_...

Create a suite

POST /api/v1/eval-suites
Content-Type: application/json
Authorization: Bearer {API_KEY}

{
  "source_id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "name": "cloud-improve-smoke",
  "source_kind": "manual",
  "verifier_kind": "deterministic",
  "supports_no_skill": false,
  "cases_json": [
    {
      "case_kind": "task_package",
      "instruction": "Verify the skill archive is mounted and contains a skill document.",
      "environment_ref": "r2://selftune-registry/benchmarks/cloud-improve/smoke-basic.tar.gz",
      "verifier_ref": "tests/test.sh",
      "skill_bundle_ref": "skill-under-test",
      "resource_hints": {
        "working_dir": ".",
        "timeout_ms": 120000
      }
    }
  ]
}

Request fields

Field	Type	Required	Notes
`source_id`	UUID string	No	Associate the suite with one cloud source
`name`	string	Yes	Human-readable label
`source_kind`	`manual` \| `imported` \| `contributor_aggregate` \| `org_exemplars`	Yes	Runnable today: `manual` suites with `llm_judge` or `deterministic`, and `imported` suites when they are deterministic `task_package` suites
`verifier_kind`	`llm_judge` \| `deterministic` \| `structured_rubric`	Yes	Only `llm_judge` and `deterministic` are runnable today
`supports_no_skill`	boolean	No	Whether the suite can also evaluate a `no_skill` baseline
`resource_limits_json`	object	No	Optional execution hints
`cases_json`	array	Yes	Between 1 and 500 cases

Runnable case kinds

`llm_judge`

Allowed case kinds:

trigger_query
llm_rubric

`deterministic`

Allowed case kinds:

exact_match
json_schema
boolean_assertion
task_package

Task-package deterministic cases

task_package is the hosted benchmark-style lane. It is designed to stay portable toward SkillsBench and Harbor-style task packages while using the current SelfTune runner contract. Required fields:

Field	Description
`case_kind`	Must be `task_package`
`environment_ref`	Archive containing the environment under test

Optional fields:

Field	Description
`instruction`	Human-readable task instruction passed to the verifier as `TASK_INSTRUCTION`
`verifier_ref`	Path inside the environment archive to the verifier entrypoint. Defaults to `tests/test.sh`
`skill_bundle_ref`	Path inside the environment archive where SelfTune mounts the skill under test. Defaults to `skill-under-test`
`oracle_ref`	Optional script to run before the verifier
`resource_hints`	Timeout, working directory, and extra environment hints

The environment archive can live in:

r2://bucket/key
an HTTP(S) URL that the runtime can fetch

Example response

{
  "suite": {
    "id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "name": "cloud-improve-smoke",
    "source_kind": "manual",
    "verifier_kind": "deterministic",
    "supports_no_skill": false,
    "cases_json": [
      {
        "case_kind": "task_package",
        "environment_ref": "r2://selftune-registry/benchmarks/cloud-improve/smoke-basic.tar.gz"
      }
    ]
  }
}

Update a suite

Updates an existing eval suite. All fields are optional — only include fields you want to change.

PATCH /api/v1/eval-suites/{id}
Content-Type: application/json
Authorization: Bearer {API_KEY}

{
  "name": "Updated suite name",
  "cases_json": [
    {
      "case_kind": "trigger_query",
      "query": "how do I deploy?",
      "expectation": "should_trigger"
    }
  ]
}

Updatable fields

Field	Type	Notes
`name`	string	New display name
`verifier_kind`	`llm_judge` \| `deterministic` \| `structured_rubric`	Change verifier type
`supports_no_skill`	boolean	Toggle no-skill baseline support
`resource_limits_json`	object	Execution hints
`cases_json`	array	Replace all cases (1–500)

Notes

Mixed-case suites are allowed at the schema layer, but the suite kind is only inferred as task_package when every case is task_package.
task_package is the deterministic execution lane. It does not replace the existing SkillsBench import path, which still adapts imported query corpora into trigger_query cases.
imported is the source-kind for benchmark suites imported from external corpora or package manifests. In the current hosted flow, imported suites are runnable only when every case is task_package.
if you are importing a benchmark manifest instead of hand-authoring JSON, start with Imported benchmark suites and the bun run import:cloud-improve-benchmark-suite helper.

​Overview

​Endpoints

​Create a suite

​Request fields

​Runnable case kinds

​llm_judge

​deterministic

​Task-package deterministic cases

​Example response

​Update a suite

​Updatable fields

​Notes