Skip to main content

Overview

Imported benchmark suites are the hosted path for bringing external task-package manifests into SelfTune without pretending they were authored as manual quick-eval suites. Use this lane when you already have a benchmark package that defines:
  • task instructions
  • environment archives
  • verifier entrypoints
  • optional oracle or setup hooks
The hosted runtime keeps the same execution contract as manual task_package suites. The difference is provenance: the suite is explicitly marked source_kind = imported.

What runs today

Imported suites are runnable today when all of the following are true:
  • source_kind = imported
  • verifier_kind = deterministic
  • every case is case_kind = task_package
This is the first narrow hosted SkillsBench-style import lane. Query-corpus imports still adapt to trigger_query cases separately.

Manifest format

An imported benchmark manifest is a JSON file with a suite-level header plus a list of task-package cases.
{
  "name": "cloud-improve-task-package-basic",
  "supports_no_skill": false,
  "cases": [
    {
      "task_id": "smoke-basic",
      "instruction": "Verify the benchmark environment is available.",
      "environment_ref": "r2://selftune-registry/benchmarks/cloud-improve/smoke-basic.tar.gz",
      "verifier_ref": "tests/no_winner.sh",
      "skill_bundle_ref": "skill-under-test",
      "resource_hints": {
        "working_dir": ".",
        "timeout_ms": 120000
      }
    }
  ]
}

Suite fields

FieldTypeRequiredNotes
namestringYesHuman-readable suite name
supports_no_skillbooleanNoEnables no_skill baseline execution
resource_limits_jsonobjectNoSuite-level execution hints
casesarrayYesBetween 1 and 500 task-package cases

Case fields

FieldTypeRequiredNotes
task_idstringYesStable case id within the suite
instructionstringYesHuman-readable task instruction
environment_refstringYesr2://bucket/key or HTTP(S) archive
verifier_refstringYesPath inside the environment archive
skill_bundle_refstringNoMount point for the skill under test
oracle_refstringNoOptional setup script run before the verifier
resource_hintsobjectNoWorking directory, timeout, and env hints
sourcestringNoProvenance or attribution
tagsstring[]NoLabels for filtering and reporting
weightnumberNoRelative score weight for the case

Importing a manifest

The repo ships with a helper script that validates a manifest and creates the corresponding cloud eval suite over the live API.
bun run import:cloud-improve-benchmark-suite -- \
  --manifest scripts/fixtures/cloud-improve-task-package-basic/imported-benchmark.manifest.json \
  --source-id <cloud_source_uuid>
The script:
  • validates the manifest against the shared schema
  • converts each imported case to a hosted task_package case
  • creates the suite through POST /api/v1/eval-suites
Use --dry-run to inspect the generated payload without creating a suite.

Relationship to manual eval suites

Manual task-package suites and imported benchmark suites run through the same hosted deterministic runtime:
  • Cloudflare Queue
  • Workflow orchestration
  • Sandbox-backed filesystem/process execution
The difference is how the cases are authored and tracked. For hand-authored suites created directly in the dashboard or API, see Eval suites and Eval Suites API.