Imported benchmark suites

Overview

Imported benchmark suites are the hosted path for bringing external task-package manifests into SelfTune without pretending they were authored as manual quick-eval suites. Use this lane when you already have a benchmark package that defines:

task instructions
environment archives
verifier entrypoints
optional oracle or setup hooks

The hosted runtime keeps the same execution contract as manual task_package suites. The difference is provenance: the suite is explicitly marked source_kind = imported.

What runs today

Imported suites are runnable today when all of the following are true:

source_kind = imported
verifier_kind = deterministic
every case is case_kind = task_package

This is the first narrow hosted SkillsBench-style import lane. Query-corpus imports still adapt to trigger_query cases separately.

Manifest format

An imported benchmark manifest is a JSON file with a suite-level header plus a list of task-package cases.

{
  "name": "cloud-improve-task-package-basic",
  "supports_no_skill": false,
  "cases": [
    {
      "task_id": "smoke-basic",
      "instruction": "Verify the benchmark environment is available.",
      "environment_ref": "r2://selftune-registry/benchmarks/cloud-improve/smoke-basic.tar.gz",
      "verifier_ref": "tests/no_winner.sh",
      "skill_bundle_ref": "skill-under-test",
      "resource_hints": {
        "working_dir": ".",
        "timeout_ms": 120000
      }
    }
  ]
}

Suite fields

Field	Type	Required	Notes
`name`	string	Yes	Human-readable suite name
`supports_no_skill`	boolean	No	Enables `no_skill` baseline execution
`resource_limits_json`	object	No	Suite-level execution hints
`cases`	array	Yes	Between 1 and 500 task-package cases

Case fields

Field	Type	Required	Notes
`task_id`	string	Yes	Stable case id within the suite
`instruction`	string	Yes	Human-readable task instruction
`environment_ref`	string	Yes	`r2://bucket/key` or HTTP(S) archive
`verifier_ref`	string	Yes	Path inside the environment archive
`skill_bundle_ref`	string	No	Mount point for the skill under test
`oracle_ref`	string	No	Optional setup script run before the verifier
`resource_hints`	object	No	Working directory, timeout, and env hints
`source`	string	No	Provenance or attribution
`tags`	string[]	No	Labels for filtering and reporting
`weight`	number	No	Relative score weight for the case

Importing a manifest

The repo ships with a helper script that validates a manifest and creates the corresponding cloud eval suite over the live API.

bun run import:cloud-improve-benchmark-suite -- \
  --manifest scripts/fixtures/cloud-improve-task-package-basic/imported-benchmark.manifest.json \
  --source-id <cloud_source_uuid>

The script:

validates the manifest against the shared schema
converts each imported case to a hosted task_package case
creates the suite through POST /api/v1/eval-suites

Use --dry-run to inspect the generated payload without creating a suite.

Relationship to manual eval suites

Manual task-package suites and imported benchmark suites run through the same hosted deterministic runtime:

Cloudflare Queue
Workflow orchestration
Sandbox-backed filesystem/process execution

The difference is how the cases are authored and tracked. For hand-authored suites created directly in the dashboard or API, see Eval suites and Eval Suites API.

​Overview

​What runs today

​Manifest format

​Suite fields

​Case fields

​Importing a manifest

​Relationship to manual eval suites