Overview
Imported benchmark suites are the hosted path for bringing external task-package manifests into SelfTune without pretending they were authored as manual quick-eval suites. Use this lane when you already have a benchmark package that defines:- task instructions
- environment archives
- verifier entrypoints
- optional oracle or setup hooks
task_package suites. The difference is provenance: the suite is explicitly
marked source_kind = imported.
What runs today
Imported suites are runnable today when all of the following are true:source_kind = importedverifier_kind = deterministic- every case is
case_kind = task_package
trigger_query cases separately.
Manifest format
An imported benchmark manifest is a JSON file with a suite-level header plus a list of task-package cases.Suite fields
| Field | Type | Required | Notes |
|---|---|---|---|
name | string | Yes | Human-readable suite name |
supports_no_skill | boolean | No | Enables no_skill baseline execution |
resource_limits_json | object | No | Suite-level execution hints |
cases | array | Yes | Between 1 and 500 task-package cases |
Case fields
| Field | Type | Required | Notes |
|---|---|---|---|
task_id | string | Yes | Stable case id within the suite |
instruction | string | Yes | Human-readable task instruction |
environment_ref | string | Yes | r2://bucket/key or HTTP(S) archive |
verifier_ref | string | Yes | Path inside the environment archive |
skill_bundle_ref | string | No | Mount point for the skill under test |
oracle_ref | string | No | Optional setup script run before the verifier |
resource_hints | object | No | Working directory, timeout, and env hints |
source | string | No | Provenance or attribution |
tags | string[] | No | Labels for filtering and reporting |
weight | number | No | Relative score weight for the case |
Importing a manifest
The repo ships with a helper script that validates a manifest and creates the corresponding cloud eval suite over the live API.- validates the manifest against the shared schema
- converts each imported case to a hosted
task_packagecase - creates the suite through
POST /api/v1/eval-suites
--dry-run to inspect the generated payload without creating a suite.
Relationship to manual eval suites
Manual task-package suites and imported benchmark suites run through the same hosted deterministic runtime:- Cloudflare Queue
- Workflow orchestration
- Sandbox-backed filesystem/process execution