Skip to main content

Overview

Eval suites define the cases that a cloud improve run scores against. SelfTune currently supports two runnable verifier families:
  • llm_judge for trigger-query or rubric-style cases
  • deterministic for exact-match, boolean, JSON-schema, and task_package cases
Use deterministic suites whenever you can express the check as code or a benchmark task package. They are cheaper to run and easier to audit.

Endpoints

MethodPathPurpose
GET/api/v1/eval-suitesList suites for your organisation
GET/api/v1/eval-suites/{id}Get one suite
POST/api/v1/eval-suitesCreate a suite
PATCH/api/v1/eval-suites/{id}Update a suite
All endpoints require an API key:
Authorization: Bearer st_live_...

Create a suite

POST /api/v1/eval-suites
Content-Type: application/json
Authorization: Bearer {API_KEY}
{
  "source_id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "name": "cloud-improve-smoke",
  "source_kind": "manual",
  "verifier_kind": "deterministic",
  "supports_no_skill": false,
  "cases_json": [
    {
      "case_kind": "task_package",
      "instruction": "Verify the skill archive is mounted and contains a skill document.",
      "environment_ref": "r2://selftune-registry/benchmarks/cloud-improve/smoke-basic.tar.gz",
      "verifier_ref": "tests/test.sh",
      "skill_bundle_ref": "skill-under-test",
      "resource_hints": {
        "working_dir": ".",
        "timeout_ms": 120000
      }
    }
  ]
}

Request fields

FieldTypeRequiredNotes
source_idUUID stringNoAssociate the suite with one cloud source
namestringYesHuman-readable label
source_kindmanual | imported | contributor_aggregate | org_exemplarsYesRunnable today: manual suites with llm_judge or deterministic, and imported suites when they are deterministic task_package suites
verifier_kindllm_judge | deterministic | structured_rubricYesOnly llm_judge and deterministic are runnable today
supports_no_skillbooleanNoWhether the suite can also evaluate a no_skill baseline
resource_limits_jsonobjectNoOptional execution hints
cases_jsonarrayYesBetween 1 and 500 cases

Runnable case kinds

llm_judge

Allowed case kinds:
  • trigger_query
  • llm_rubric

deterministic

Allowed case kinds:
  • exact_match
  • json_schema
  • boolean_assertion
  • task_package

Task-package deterministic cases

task_package is the hosted benchmark-style lane. It is designed to stay portable toward SkillsBench and Harbor-style task packages while using the current SelfTune runner contract. Required fields:
FieldDescription
case_kindMust be task_package
environment_refArchive containing the environment under test
Optional fields:
FieldDescription
instructionHuman-readable task instruction passed to the verifier as TASK_INSTRUCTION
verifier_refPath inside the environment archive to the verifier entrypoint. Defaults to tests/test.sh
skill_bundle_refPath inside the environment archive where SelfTune mounts the skill under test. Defaults to skill-under-test
oracle_refOptional script to run before the verifier
resource_hintsTimeout, working directory, and extra environment hints
The environment archive can live in:
  • r2://bucket/key
  • an HTTP(S) URL that the runtime can fetch

Example response

{
  "suite": {
    "id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "name": "cloud-improve-smoke",
    "source_kind": "manual",
    "verifier_kind": "deterministic",
    "supports_no_skill": false,
    "cases_json": [
      {
        "case_kind": "task_package",
        "environment_ref": "r2://selftune-registry/benchmarks/cloud-improve/smoke-basic.tar.gz"
      }
    ]
  }
}

Update a suite

Updates an existing eval suite. All fields are optional — only include fields you want to change.
PATCH /api/v1/eval-suites/{id}
Content-Type: application/json
Authorization: Bearer {API_KEY}
{
  "name": "Updated suite name",
  "cases_json": [
    {
      "case_kind": "trigger_query",
      "query": "how do I deploy?",
      "expectation": "should_trigger"
    }
  ]
}

Updatable fields

FieldTypeNotes
namestringNew display name
verifier_kindllm_judge | deterministic | structured_rubricChange verifier type
supports_no_skillbooleanToggle no-skill baseline support
resource_limits_jsonobjectExecution hints
cases_jsonarrayReplace all cases (1–500)

Notes

  • Mixed-case suites are allowed at the schema layer, but the suite kind is only inferred as task_package when every case is task_package.
  • task_package is the deterministic execution lane. It does not replace the existing SkillsBench import path, which still adapts imported query corpora into trigger_query cases.
  • imported is the source-kind for benchmark suites imported from external corpora or package manifests. In the current hosted flow, imported suites are runnable only when every case is task_package.
  • if you are importing a benchmark manifest instead of hand-authoring JSON, start with Imported benchmark suites and the bun run import:cloud-improve-benchmark-suite helper.