This page tracks user-facing selftune changes in a format that is easier to scan than raw commit history. Subscribe to the RSS feed at docs.selftune.dev/changelog/rss.xml, or browse packaged artifacts and compare links in GitHub releases. Matching OSS release tags are enriched from the corresponding entry on this page.Documentation Index
Fetch the complete documentation index at: https://docs.selftune.dev/llms.txt
Use this file to discover all available pages before exploring further.
Tags on this page use a fixed taxonomy so filters stay stable over time:
Cloud, CLI, Platforms, OSS, Dashboard, Registry, Billing,
Community, and Breaking change.Skill-level and source-level cloud trust panels now add coarse 30-day
outcome buckets across the last 90 days on top of the short weekly history.That gives operators a broader trust read before they queue another hosted
run, instead of relying only on the most recent few outcomes.
2026-04-20
CloudDashboard
Cloud trust now correlates benchmark health with observed proposal outcomes
Skill-level and source-level cloud trust panels now group recent suite-backed
runs by each suite’s latest canonical saved check state.That makes it easier to see whether suites that currently look healthy are
actually lining up with helped outcomes after apply, or whether regressions
are clustering around failed or missing saved checks.
Skill-level and source-level cloud trust panels now include a compact
multi-window post-apply history derived from recent completed observation
buckets.That makes it easier to see whether the last few windows were improving,
regressing, mixed, or steady instead of relying on a single rolling badge.
Skill-level and source-level cloud trust panels now include a compact recent
outcome timeline based on the same post-apply observation summary that powers
the direction badge and latest outcome link.That keeps a short run of concrete helped, regressed, or inconclusive
proposal outcomes visible without drilling into proposal history.
Skill-level and source-level cloud trust panels now condense recent
post-apply outcomes into a compact direction signal:
Improving,
Regressing, Mixed, Steady, or Needs more signal.That makes it easier to tell whether trust is getting better or worse
without opening each proposal outcome one by one.The skill-level and source-level cloud trust panels now surface the latest
completed post-apply outcome with a direct proposal link.That means stale or watch-mode trust warnings now point at concrete outcome
evidence instead of leaving operators to hunt through proposal history by
hand.
Source trust summaries no longer wait for proposal-detail reads or the
batch observation scorer to reflect ended apply windows.When an observation window has already ended, the cloud trust summary now
scores it on read and promotes it into the completed outcome counts
immediately.
2026-04-20
CloudDashboard
Cloud trust summaries now show when recent applies are still being observed
Cloud trust summaries no longer flatten everything into completed outcomes.
The selected source trust card and run preflight now show when recent applies
are still inside the post-apply observation window, so operators can tell
when the latest outcome counts are not final yet.
2026-04-20
CloudDashboard
Observed-skill cloud controls now warn before queueing runs with stale or failing trust signals
The observed-skill
Cloud Improve panel now raises explicit run preflight
warnings when the selected source trust is already degraded or when the
selected suite’s last canonical task-package smoke check failed.That keeps risky benchmark state visible at the actual queue point instead of
only inside the deeper source detail screens.The
Cloud Improve panel on observed skill pages now shows the selected
cloud source’s trust summary, recent post-apply observation counts, and the
latest canonical task-package smoke result for the selected suite.That means operators can check benchmark freshness and recent real-world
outcomes before they queue another hosted improve run, instead of drilling
into the cloud source page first.2026-04-20
CloudDashboard
Cloud source evidence can now jump straight into task-package draft authoring
Suggested trigger cases and recent run-pressure cards can now promote
directly into a review-only task-package draft instead of always starting on
the structured path first.That direct promotion path now follows the first persisted draft/refine step
with initial environment and verifier asset generation, so operators start from
source evidence plus concrete draft files instead of an empty task-package
scaffold.
Suggested trigger cases and recent run-pressure cases no longer create
task/output drafts purely in page state.Draft creation now goes through a review-first promotion route that builds seed
enrichment on the server and invokes the same Think or fallback refinement path
used by the persisted authoring session.
Cloud source pages now show the recent review-first authoring steps attached
to a persisted task/output draft instead of only the latest draft snapshot.The authoring session now keeps a bounded activity log for draft saves and
promotions, Think or fallback refinement, task-package asset generation, bundle
materialization, runnable smoke checks, and draft clears.
2026-04-20
CloudDashboard
Cloud advanced run settings now show canonical task-package smoke freshness in suite selection
The advanced run drawer no longer hides whether a saved task-package suite
most recently passed or failed its canonical smoke check.This reuses the same summary-level saved-suite smoke signal in the advanced run
suite chooser, so operators can compare suite trust while configuring a run,
not just while editing the suite.
Operators can now compare saved task-package suites before opening one in the
editor.This adds summary-level canonical smoke state to hosted eval-suite list
responses and surfaces it directly in the cloud source page suite picker, so
saved-check freshness is visible during suite selection as well as after a
suite is opened.
2026-04-20
CloudDashboard
Cloud source setup summaries now show the latest canonical task-package smoke result outside the editor
Operators no longer need to open the eval editor to see the last saved
task_package smoke result.This now surfaces the latest canonical smoke state directly in the cloud source
page setup summary, so the trust signal is visible in the broader read model as
well as in the editor.2026-04-20
CloudDashboard
Cloud saved task-package suites now remember the last canonical smoke result
Running a saved canonical
task_package suite smoke check no longer yields a
result that disappears as soon as the request ends.This now writes the latest canonical smoke result back into the saved suite
metadata and surfaces it in the cloud source page editor, so operators can see
the last saved-suite check before deciding whether to rerun it.2026-04-20
CloudDashboard
Cloud source pages can now run saved canonical task-package suites once against the current snapshot
Saved canonical
task_package suites no longer require a separate improve
run just to verify the saved case still works against the current source
snapshot.This adds a narrow saved-suite smoke action on cloud source pages and matching
eval-suite routes, so operators can run a saved canonical task-package case
once before they reuse it in hosted improve runs.2026-04-20
CloudDashboard
Cloud task-package saves now retain draft provenance and latest smoke results in the canonical case payload
Runnable
task_package saves no longer drop the context that produced the
draft.This now writes a typed task_package_metadata block into matching canonical
cases, preserving the seed evidence, expected-outcome scaffold, optional notes,
and latest smoke result that came from the authoring session.2026-04-20
CloudDashboard
Cloud task-package saves now require a fresh smoke result before a runnable case can be written into a canonical suite
Runnable
task_package drafts on cloud source pages can no longer be saved
into a canonical hosted eval suite if the latest smoke result is missing or
stale.This now blocks the save both on the source page and in the eval-suite
create/update routes, so runnable task-package cases must be freshly
smoke-checked before they become part of the authoritative suite.2026-04-20
CloudDashboard
Cloud task-package drafts now keep the last smoke result visible and mark it stale when the scaffold, assets, or bundle change
Runnable task-package drafts on cloud source pages no longer silently lose
their last smoke-check result when the scaffold or generated bundle changes.This now keeps the latest smoke result visible, marks it stale with an explicit
reason, and tells the operator when the draft needs to be smoke-checked again
before save.
2026-04-20
CloudDashboard
Cloud source pages can now smoke-check runnable task-package drafts before saving them into a canonical suite
Runnable task-package drafts on cloud source pages can now be executed once
against the current snapshot before they are saved into a canonical suite.This persists the latest pass/fail result back into the authoring session, so
operators can verify the materialized bundle and current snapshot still work
together before promoting the case.
2026-04-20
CloudDashboard
Cloud task-package drafts now switch into an explicit runnable state once their review-only bundle is materialized
Materialized task-package drafts on cloud source pages no longer stay labeled
like scaffolds.This now promotes them into an explicit runnable draft state, updates the
promotion preview and bundle preview accordingly, and makes it clear that the
existing save flow will write a real canonical
task_package case.2026-04-20
CloudDashboard
Cloud task-package drafts can now materialize review-only bundles into real R2-backed environment archives
Persisted task-package drafts on cloud source pages can now turn generated
bundle files into a real review-only environment archive.This uploads the bundle to R2, points the draft scaffold at the materialized
archive, and keeps the archive descriptor in the same authoring session until
the operator explicitly promotes the draft further.
2026-04-20
CloudDashboard
Cloud task-package drafts now roll generated asset files into a review-only bundle preview with explicit archive-materialization readiness
Persisted task-package drafts on cloud source pages no longer stop at raw
environment/verifier asset text.This now rolls the generated files into a review-only bundle preview, marks
when the draft is ready for archive materialization, and keeps the whole flow
inside the persisted authoring session until an operator explicitly promotes it
further.
2026-04-20
CloudDashboard
Cloud task-package drafts can now generate review-only environment and verifier asset drafts through the authoring agent
Persisted task-package drafts on cloud source pages can now generate a
review-only environment manifest draft and verifier script draft through the
authoring agent.This keeps the new task-package authoring flow behind the same persisted draft
session, surfaces whether the generated assets came from Think or the fallback
template path, and avoids writing anything canonical until the operator decides
the draft is ready.
2026-04-20
CloudDashboard
Cloud source pages now support real task-package draft authoring, including editable scaffold fields and canonical task-package case saves
Persisted task/output drafts on cloud source pages can now move past a
placeholder preview into real task-package scaffold authoring.This adds editable instruction, environment, verifier, oracle, skill mount, and
resource-hint fields to the review-only draft flow, lets operators save that
scaffold back into the persisted authoring session, and makes the existing save
path emit a real
task_package case when that is the chosen promotion target.2026-04-20
CloudDashboard
Cloud source pages now show canonical promotion previews for task/output drafts and let operators switch a persisted draft between a structured check and a review-only task-package scaffold
Persisted task/output drafts on cloud source pages now show what the current
promotion target will become in the canonical eval pipeline before anything
is saved.This makes the next step explicit: a draft can stay on the structured
deterministic path, or switch into a review-only task-package scaffold with the
expected environment and verifier placeholders called out up front.
2026-04-20
CloudDashboard
Cloud source pages can now refine persisted task/output drafts through a review-only authoring agent, with an explicit fallback when Workers AI is unavailable
Persisted task/output drafts on cloud source pages can now run through a
bounded authoring-agent refinement step before the operator saves anything
canonical.This keeps the same draft/session contract, surfaces whether the refinement
came from a Think-backed path or a deterministic fallback, and lets the page
stay review-first even when Workers AI is unavailable. The dashboard exposes
that action through the new public refine route instead of a page-local-only
mutation path.
2026-04-20
CloudDashboard
Cloud source pages now show review-only guidance for persisted task/output drafts and let operators apply that scaffold back into the draft before saving it
Persisted task/output drafts on cloud source pages now show API-derived
review-only guidance from matching suggestions, recent run pressure, and
saved eval-suite overlap.This means operators can see what failure a draft protects against, what
verifier shape is likely to fit best, whether they should extend an
existing suite, and apply the suggested scaffold back into the deterministic
draft before they save anything canonical.
2026-04-20
CloudDashboard
Cloud task/output drafts now keep seed evidence, promotion target, and expected-outcome scaffolding, so deterministic eval authoring survives refreshes with real provenance instead of thin form state
Cloud source pages now persist richer deterministic draft metadata for eval
authoring, including the originating evidence, the intended promotion target,
and a first expected-outcome scaffold.This means review-only task/output drafts are no longer just local editor
fields. Operators can refresh and resume the draft while still seeing what
seeded it and what kind of deeper check it is trying to become.
2026-04-19
CloudDashboard
Cloud source pages now keep one resumable task/output draft per source, so operators can refresh and continue deterministic eval authoring without rebuilding the draft from scratch
Cloud improve now stores a review-only task/output draft per source in the
runtime layer and exposes it back through source detail.This means a deterministic draft started from trigger evidence is no longer
purely local page state: operators can refresh, resume the draft, or clear it
without overwriting the saved trigger-confidence suite.
2026-04-19
CloudDashboard
Improve-run detail now keeps the active timeline step and proposal links in sync after completion, so operators can see live progress and open the winning proposal without a manual refresh
Improve-run detail now seeds the current phase into the timeline immediately,
keeps the active step visibly live, and briefly re-checks proposal links
after a successful run until the winning proposal is available.This closes two gaps on the same surface: active runs now look active where the
operator is already reading, and completed runs no longer require a manual
refresh just to open the winning proposal.
2026-04-19
CloudDashboard
Improve-run pages now keep refreshing proposal links briefly after a run completes, so new proposal links appear without a manual reload
Cloud improve-run pages now re-sync proposal links after terminal updates and
keep polling briefly when a winning candidate exists but the proposal link has
not settled yet.This closes the gap where a run could finish successfully and create a proposal
seconds later, while the improve page still looked like proposal creation had
been skipped until a manual refresh.
2026-04-19
CloudDashboard
Cloud source pages now show live source-coordination state, including the active reserved run and queued rerun/cancel intent
Cloud source detail now carries a compact coordinator read model from the
runtime, and the source page surfaces that state directly.This makes it visible when a source already has an active reserved run, whether
cancel was requested, and whether a rerun is queued, without inferring it only
from raw run rows.
2026-04-19
CloudDashboard
Cloud source pages can now promote trigger evidence into a detached task/output draft, so operators can scaffold deterministic checks without overwriting the saved trigger suite
Pending eval suggestions and recent run-pressure cards now offer a
Draft task check action that opens the eval editor in deterministic mode with a new,
unsaved task/output draft scaffold.This keeps the saved trigger-confidence suite intact while giving operators a
fast path to start deeper task/output coverage from real evidence.2026-04-19
CloudDashboard
Improve-run timelines now keep the active phase on a colored dot, so the phase progression stays visible while a run is live
The live step on the cloud improve-run timeline now stays on the same
phase-colored dot as the rest of the run history instead of switching to a
generic spinner.This keeps the setup, evaluation, drafting, and finalization phases visually
distinct even while the run is still active.
2026-04-19
CloudDashboard
Cloud skill pages now separate trigger-confidence evals from task/output checks in the onboarding and eval-editor language
Cloud skill onboarding and eval-editor copy now makes the intended eval
progression explicit: start with trigger confidence, then add task/output
checks once the skill is activating in the right places.This matches the hosted eval contract more closely and makes it clearer that
discoverability/routing checks and deeper task correctness checks answer
different questions.
2026-04-19
CloudDashboard
Cloud source pages now expose a public rerun-analysis action, so operators can reprocess the current snapshot without creating a new upload or GitHub sync
Cloud source pages now include a
Rerun analysis action that triggers the
existing snapshot analysis pipeline for the current snapshot through a public
dashboard route.This makes it possible to refresh validation, lint, capability, and structural
reports after cloud-side analysis logic changes, without creating a new upload
or GitHub sync just to force a re-run.2026-04-19
CloudDashboard
Improve-run pages now preserve run scope on per-candidate proposal links, so review navigation stays inside the run-specific proposal queue
Candidate-level
View proposal links on improve-run pages now keep the
current ?run=... context instead of dropping back to the global proposal
queue.This keeps the operator inside the run-specific review flow whether they open
the winning proposal from the summary header or from the candidate list.2026-04-19
CloudDashboard
Proposal cards now render separate skill and review links instead of nesting anchors, fixing a hydration bug on the dashboard proposals page
Cloud proposal cards now show an explicit
Review proposal link inside the
card instead of wrapping the whole card in a detail link.This removes invalid nested anchor markup when a proposal card also links to its
skill page, which fixes the hydration error on the dashboard proposals route.2026-04-19
CloudDashboard
Cloud source pages now read structural provenance directly from the proposal queue payload, removing an extra proposal-detail fetch from the latest-run review path
The run-scoped proposal queue now carries candidate provenance for cloud
improve proposals, so cloud source pages can explain the newest structural
proposal directly from the queue payload.This removes the extra proposal-detail request the source page used to make just
to show the latest proposal’s structural origin.
2026-04-19
CloudDashboard
Cloud skill source pages now show which structural recommendation produced the newest linked proposal, so authors can connect snapshot analysis to the actual reviewable candidate
The structure-candidates panel on each cloud skill source page now surfaces
the newest linked proposal’s exact structural recommendation and deterministic
strategy when that proposal came from a structure run.This closes the gap between “top recommendations on the snapshot” and “the
candidate you can review right now,” so authors can see why the latest proposal
exists before opening the proposal detail page.
2026-04-19
CloudDashboard
Cloud proposal queues and detail routes now have explicit review-flow coverage, reducing the risk of silent regressions in run-scoped proposal review
Added route-level coverage for the run-scoped proposal queue and proposal
detail lookup used by cloud-improve review flows.This hardens the path from improve-run pages into proposal review by checking
both the queue filter (
cloud_run_id) and proposal-detail validation behavior.2026-04-19
CloudDashboard
Cloud improve and skills surfaces now use softer, consistent status treatments instead of mixing heavy cyan badges, raw enum labels, and duplicate readiness states
Cloud skill cards, improve-run pages, and setup checklists now use a
standardized status system with softer status chips, text-only status
labels where appropriate, and clearer warning/error semantics.This removes raw labels like
cloud_ready, collapses redundant Ready
states on the skills library cards, and brings advanced run settings into a
drawer with more readable model selection options.2026-04-19
CloudDashboard
Proposal detail now normalizes structural provenance consistently after refresh and keeps run-scoped back navigation on the same helper path as the proposal queue
Proposal detail now uses shared helpers to normalize structural provenance and
run-scoped back navigation, so refreshed review pages keep the same
candidate-rationale and queue-return path instead of relying on ad hoc inline
logic.This keeps the proposal review surface more predictable as cloud-improve
candidates add richer provenance metadata and more scoped review flows.
2026-04-19
CloudDashboard
Proposal detail now shows which structural recommendation produced a cloud structure candidate, so operators can see why a package rewrite was drafted before applying it
Cloud proposal detail now surfaces the structural recommendation and
deterministic strategy that produced a structure candidate, such as
extract_references or harden_script_ergonomics.That makes package-backed proposal review more defensible: operators can see
which structural analysis signal led to the candidate before deciding whether
to apply it to a draft or GitHub PR.2026-04-19
CloudDashboard
Improve-run detail now links straight into the run-scoped proposal queue when a candidate frontier produced more than one reviewable proposal
Improve-run detail pages now preserve run context when you open the winning
proposal, and they surface a direct
Review run proposals action whenever a
hosted run produced more than one reviewable proposal.That keeps the run frontier, proposal queue, and individual proposal review
pages tied together, so multi-candidate review flows no longer require manual
navigation between /improve and /proposals.2026-04-19
CloudDashboard
Run-scoped proposal review now preserves that scope when you open proposal detail, so it is easier to move through a single cloud-improve queue without losing context
When you open proposal detail from a run-scoped proposals list, the detail
page now keeps that run context and offers a
Back to run proposals path
instead of dropping you back into the global proposal backlog.This keeps multi-candidate cloud review flows tighter: operators can move
between the run-specific proposal queue and individual proposal detail pages
without reapplying filters or losing their place.2026-04-19
CloudDashboard
Proposal review can now be scoped to a single improve run, so multi-candidate cloud runs no longer dump you back into the full proposal backlog
The proposals page now accepts a run-scoped view for cloud-improve runs,
letting operators review only the proposals created by a single hosted run
instead of filtering mentally through the full backlog.Cloud skill source pages now use that view when the latest run produced
multiple proposals, so the
Structure candidates panel can link directly
into proposal review even when there is more than one candidate to inspect.2026-04-19
CloudDashboard
Cloud skill source pages now explain when structure proposals are viable and link directly into proposal review when the latest run produced one
Cloud skill source pages now synthesize the typed structural-analysis report
into a dedicated
Structure candidates panel instead of leaving operators to
infer package-shape readiness from raw technical details.The panel now shows whether the current snapshot is ready for structure
proposals, still review-first because of execution limits, or simply not a
structural change candidate right now. When the latest run already produced a
reviewable proposal, the page links straight into the proposal review flow;
otherwise it routes back to the latest run frontier.2026-04-19
CloudDashboard
Cloud proposal review now renders the real candidate diff for package-structure changes and respects GitHub PR apply targets
Cloud proposal detail now renders the unified diff from the winning candidate
package instead of only showing placeholder
see candidate archive text.
Structure-focused proposals can be reviewed as real package changes, and the
page links directly to the candidate archive when you want the full package.Proposal apply now also respects the run’s configured apply target. If a
cloud-improve run was set to write back through GitHub, the proposal page now
offers a GitHub PR action instead of always defaulting to draft promotion.2026-04-19
CloudDashboard
Cloud proposal detail now keeps the originating run and apply history visible after refresh
Cloud proposal detail now links back to the originating improve run so you
can move from a pending proposal into the full candidate frontier and
evaluation evidence without hunting for the run separately.The page also now shows recent apply attempts, including whether the attempt
targeted a draft or GitHub PR, when it ran, and the PR URL or error message
when one is available. That keeps proposal review useful even after the page
is refreshed or revisited later.
2026-04-19
CloudDashboard
The proposals index now recognizes archive-backed cloud-improve proposals instead of showing them as fake single-field diffs
The proposals index now correctly tags cloud-improve proposals created by the
hosted runner, even when
proposed_by is the actual runner identity instead
of the older cloud_improve string.Archive-backed cloud proposals also no longer render (see candidate archive) as if it were a field-level diff. The card now tells you it is a
package-backed change and shows whether a linked candidate package and source
run are available before you open the full review page.2026-04-18
CloudDashboard
Cloud skill source pages now surface structural analysis, script strategy, and frontmatter execution guidance instead of hiding them in raw validation blobs
Cloud source detail now renders the typed structural-analysis summary for the
current snapshot, including
SKILL.md line and token budgets, inferred
script strategy, compatibility notes, allowed-tools, and the execution flags
that explain whether cloud writeback is viable or still review-first.Validation checklists and technical details also now surface structural
recommendations as first-class findings instead of showing 0 issues for
typed reports that were previously stored outside the generic lint array
shape.2026-04-18
CloudDashboard
Ended post-apply observation windows are now scored against baseline telemetry instead of staying pending forever
Applied cloud-improve proposals now compare a pre-apply telemetry window
against the finished post-apply observation window and classify the result as
helped, inconclusive, or regressed.Proposal detail now shows the before/after live-signal breakdown for eval
volume, pass rate, missed triggers, false negatives, and false positives, and
source-level Eval suite trust now folds recent observed regressions and
helps into the trust summary so coverage isn’t the only signal. Ops can
batch-score ended windows with
bun run score:cloud-improve-observations.2026-04-18
CloudDashboard
Applied cloud-improve proposals now enter an explicit observation window instead of looking fully done the moment apply succeeds
Applied cloud-improve proposals now show a post-apply observation state on
the proposal detail page. Instead of treating
applied as the end of the
story, the page now distinguishes proposals that are still gathering live
signal from ones that will eventually be evaluated against observed outcomes.This is the first thin slice of the post-apply observation loop from the
cloud-improve quality hardening plan. It does not score outcomes yet, but it
does create a durable observation record the moment a draft promotion or
GitHub apply succeeds.2026-04-18
CloudDashboard
Cloud skill pages now show eval-suite trust signals, and the cloud-improve runner has a first judge-calibration benchmark command
Cloud skill source pages now include an
Eval suite trust panel that shows
whether the saved suite still covers recent linked telemetry and the newest
hosted improve-run pressure. Instead of treating every eval win as equally
trustworthy, the page now marks suites as Fresh, Watch, Stale, or
No signal based on how much recent evidence is actually covered by saved
trigger-query cases.The cloud-improve runner also now ships with a first reusable
judge-calibration command, so the llm_judge trigger evaluator can be
checked against a labeled benchmark fixture instead of remaining an
unmeasured instrument.2026-04-18
CloudDashboard
Cloud skill setup now auto-fixes common Agent Skills spec issues before the first snapshot is analyzed
Upload-backed and GitHub-backed cloud skill setup now canonicalize common
package issues before the first snapshot is analyzed. The setup flow rewrites
lowercase
skill.md to SKILL.md, rebuilds missing or invalid frontmatter,
normalizes the skill name to Agent Skills format, synthesizes a required
description when it is missing, and moves unsupported top-level frontmatter
fields into metadata so the package starts from a spec-compliant baseline.The setup response now also reports which fixes were applied, and the cloud
library success message calls that out before the first quick eval suite and
hosted improve run are prepared.2026-04-18
CloudDashboard
Cloud eval suites now learn from telemetry and hosted runs, write accepted suggestions directly into saved suites, and surface eval pressure from the latest run
Cloud skill pages now surface suggested trigger-query cases from the linked
observed skill whenever recent real usage exposes misses or false positives
that are not already covered by the saved suite.Hosted improve runs also now persist query-level eval evidence, so recent run
failures and regressions can feed the same review queue when the active suite
no longer covers those cases.These suggestions stay review-first: you can append them into the draft suite
from the cloud page, inspect them in the editor, and then decide whether to
save them before the next improve run. Accepted and dismissed suggestions are
now also persisted, so the same pending cases do not keep resurfacing after
you review them. The cloud source page now also keeps a reviewed history with
restore and re-accept actions, so dismissed cases can be reopened and
accepted cases can be added back into the draft without losing their
provenance. Accepted suggestions can now also be written directly into the
selected saved suite with their telemetry/run provenance preserved, instead
of stopping at the draft-only state. The cloud source page now highlights the
latest run’s eval pressure directly, and the improve run detail page surfaces
the failed/regressed queries from that run instead of forcing operators to
dig through raw artifacts to find them. Source detail reads also now degrade
safely if the new eval-suggestion review tables are one migration behind,
while review actions return a clear migration-needed error instead of a raw
database exception.
2026-04-18
CloudDashboard
Cloud improve now generates true bounded surface candidates and run pages render actual reviewed diffs
Hosted improve generation now treats
description, routing, and body
as real bounded mutation surfaces instead of prompt-only hints. A routing
candidate rewrites only routing, a description candidate rewrites only the
description, and body candidates preserve routing while updating the
non-routing sections they are allowed to touch.Improve run detail pages now also normalize old prose-only diff summaries
back into real unified diffs when the source and candidate archives exist, so
review pages show the actual changed lines instead of a rationale paragraph.2026-04-18
CloudDashboard
Improve run detail pages now show the skill context, run outcome, and readable evidence instead of raw storage URLs
Hosted improve run pages now load the source skill and eval-suite context,
summarize the winning result or failure in plain language, and show the best
candidate’s score movement and diff preview directly on the page.Evidence is still available for download, but artifact links are now grouped
and labeled by what they represent instead of exposing a wall of raw R2
URLs.
2026-04-18
CloudDashboard
Fresh cloud skills now auto-start the first hosted improve run, and the legacy in-process dispatcher is rollback-only
Fresh cloud skill sources now move directly from upload or sync into the
first hosted improve run once the quick eval suite is generated, instead of
stopping on the setup page and requiring a separate manual queue action.The API startup path also now treats the old in-process improve dispatcher as
an explicit legacy rollback path rather than part of the normal runtime.
Cloudflare remains the default hosted execution plane whenever the runtime
URL is configured.The API-key cloud-source surface also now matches the dashboard route for
creating GitHub-backed sources, which makes the same hosted improve flow
scriptable for smoke runs and automation.
New
@selftune/email package with 9 branded React Email templates: welcome,
alert notification, evolution proposal, weekly digest, team invitation, plan
upgrade, usage limit warning, getting started, and first insight. Alert emails
now use HTML templates instead of plain text. Team invitations and billing
checkout flows send branded emails automatically.2026-04-18
Cloud
Cloud improve now auto-links uploaded and GitHub-backed sources to canonical skills, and imported task-package suites are first-class
Upload-backed and GitHub-backed cloud sources now automatically create or
reuse the canonical
skills row they belong to. That closes the proposal
gap where a winning improve run could persist artifacts but skip
proposal_created because the source had no linked skill_id.The eval-suite control plane also now accepts source_kind = imported for
deterministic task_package suites, which is the first explicit hosted lane
for benchmark-style imports instead of treating every imported suite as a
manual one. The docs now also include a first-class imported benchmark page
and script path for turning package manifests into live cloud eval suites.2026-04-17
Cloud
Cloud improve now supports deterministic task-package eval suites and benchmark-style runtime docs
Hosted eval suites now accept deterministic
task_package cases, which lets
you point an improve run at a benchmark-style environment archive and verifier
script instead of relying only on trigger-query or exact-match checks.The Cloudflare runtime executes these task packages inside Sandboxes so the
verifier has a real filesystem and process boundary, and the public docs now
cover improve run events, statuses, and eval-suite API usage in the same
terminology the product uses.2026-04-17
CloudDashboard
Improve run pages now show customer-facing live progress, delay states, and clearer timeline copy
Hosted improve pages now translate runtime activity into customer-facing
progress language instead of exposing queue, worker, or transport details.
Active runs surface clearer status cards, a friendlier timeline, and “taking
longer than expected” messaging when a run stalls.The improve overview also better distinguishes active versus completed work
without making the page feel like an internal operations console.
2026-04-17
CloudDashboard
Improve run pages now refresh live while queued and running, with terminal refetch on completion
Hosted improve run detail pages now subscribe to the run event stream while a
run is
queued or running, updating phase and status live instead of
waiting for a manual refresh. When a terminal event arrives, the page
re-fetches full run detail so candidates, artifacts, and proposal state stay
in sync.The improve run list also now polls only while active runs are visible, which
keeps the overview current without constantly refetching completed history.2026-04-17
Cloud
Cloud source uploads and GitHub sync now accept lowercase skill.md packages and preserve folder paths on the API-key surface
Hosted cloud-source ingest now accepts both
SKILL.md and lowercase
skill.md when validating uploaded packages and GitHub-backed skill repos.
That keeps upload and sync behavior aligned with the rest of the hosted
analysis pipeline, which already supported both casings.The API-key Hono upload route also now preserves multipart field keys as
relative paths instead of flattening uploaded files to their basenames, so
folder uploads keep nested references/ and other package structure intact.2026-04-17
Cloud
Cloud improve runtime: Cloudflare execution plane foundation and live SSE event stream
Added foundation for Cloudflare-backed improve run execution using Queues,
Workflows, and Sandboxes. A new
GET /api/v1/improve-runs/:id/events endpoint
streams run lifecycle events via SSE, enabling live updates on the run detail
page without manual refresh. The runtime mode is controlled by
CLOUD_IMPROVE_RUNTIME_MODE and defaults to legacy with no behavior change
until explicitly switched.2026-04-17
CloudDashboard
Cloud skill validation now uses native spec checks and clearer report detail during setup
Cloud skill setup now persists one validation report per snapshot and shows
those results inline in the guided setup hero, so structural validation,
best-practice lint, and capability classification are easier to inspect
without dropping into raw logs.The hosted validation step also now runs on a native TypeScript
implementation of the Agent Skills frontmatter rules instead of shelling out
to the demonstration
skills-ref toolchain. That keeps cloud validation
deterministic in production while tightening allowed-tools parsing and
preserving clearer per-rule issue messages.Apply flows also now version and re-upload promoted skill archives more
safely. Draft apply and GitHub PR apply both keep the cloud source pointed at
the newly promoted snapshot, preserve archive manifests across lowercase
skill.md packages, and avoid corrupting frontmatter when YAML values
contain ---.2026-04-17
CloudDashboard
Cloud source APIs now honor source-type and skill filters consistently across dashboard and API-key surfaces
The hosted cloud-source list API now applies the same
type and skill_id
filters on the API-key Hono surface that the dashboard session route already
supported. That keeps browser, CLI, and smoke-test callers on one normalized
contract when listing cloud skills.This batch also tightens the hosted improve apply/runtime path so root-level
GitHub applies do not infer repo-wide deletions, runner dependencies resolve
snapshots through the correct org-scoped database client, and local Neon CLI
binding metadata is no longer tracked in git.2026-04-17
CloudDashboard
Cloud improve model selectors now load the live OpenRouter catalog and use a teacher-student default spread
The per-run model selectors on cloud and observed skill detail pages no
longer use a hardcoded GPT-only shortlist. They now load the current
OpenRouter model catalog from the server and expose the broader set of
text-capable models available through the hosted cloud runtime.Each selector also now carries an explicit recommended default for
generate, judge, and summarize. Leaving a selector empty keeps the
server-side default for that role, and the UI now spells out those defaults
directly so you can test alternatives without losing track of the intended
baseline. The selectors are now searchable comboboxes as well, so longer
model lists stay usable without scrolling through a giant dropdown.The recommended defaults now follow a clearer teacher-student split
instead of a flat GPT-only stack:
google/gemini-2.5-pro for proposal
generation, google/gemini-2.5-flash for judging, and
google/gemini-2.5-flash-lite for summarization. That keeps the strongest
model on the expensive generative step while moving validation and helper
work onto cheaper OpenRouter models.2026-04-17
CloudDashboard
Cloud skill onboarding now auto-creates a 50-case eval suite and funnels into one happy path
When you create or sync a cloud skill, the detail page now automatically
drafts and saves a 50-case quick eval suite from the current snapshot
instead of making you build one manually first. The skill page now leads with
a single guided decision: edit the generated eval suite or start the
hosted improve run.The cloud skill detail UI has also been simplified around that progression.
The primary suite is now treated as one editable artifact with save support,
advanced run controls stay collapsed by default, and metadata/report panels
are moved behind a technical-details disclosure so the page feels less like a
control plane and more like a clear product flow.The Eval summary card now surfaces the per-case origin (auto-generated
versus hand-curated versus a mixed breakdown) so you can tell at a glance
whether the suite is still the synthetic draft or has been edited. Clicking
Edit eval suite also now smooth-scrolls the editor into view, and the
hero action row has been reordered so advanced run options live next to the
edit affordance rather than after Start improve run.
2026-04-16
CloudDashboard
Overview now presents the hosted cloud loop instead of legacy first-run telemetry onboarding
The cloud dashboard overview now introduces SelfTune as a hosted review
loop instead of the older “run selftune and wait for skills to appear”
onboarding. The empty-state banner now points people toward the real cloud
path: create or import a cloud skill, shape a reviewable eval suite, run the
hosted comparison, and review the resulting proposal before draft apply.The overview also now keys that banner off cloud authoring state rather
than observed telemetry alone, so the first-run guidance stays visible until
you actually have cloud sources in the hosted product.
2026-04-16
CloudDashboard
Quick Eval Suite can now auto-seed synthetic trigger cases and edit them in a table
The cloud skill detail page now includes a real Quick Eval Suite editor
instead of only raw textareas and JSON authoring. Trigger-query cases are now
editable as table rows with expectation, invocation type, provenance, and
row-level remove actions.For
llm_judge suites, the page can also draft a synthetic seed directly
from the current snapshot’s SKILL.md. Those seeded cases are marked as
synthetic so you can review, revise, or delete them before creating the
hosted eval suite.2026-04-16
CloudDashboard
Cloud folder uploads now preserve a real skill directory and show clearer selection state
Cloud skill uploads now package the selected folder as a real skill directory
instead of flattening only its file contents into the snapshot archive. That
means downstream validation sees the skill in a proper directory layout,
which fixes false failures caused by generic wrapper names during
skills-ref
analysis.The dashboard upload flow is also clearer: the picker now behaves like a
folder intake card, shows the detected folder name and file count, confirms
whether a root SKILL.md was found, and disables upload until the selection
is actually valid.2026-04-16
CloudDashboard
Cloud improve runs now support separate generate, judge, and summarize model overrides
Cloud improve run setup now exposes three separate model selectors instead of
one shared override. You can independently pick the model used for
candidate generation, LLM judging, and summarization from the
skill detail page before queueing a run.These overrides are stored with the run itself and passed through the hosted
runner, so they no longer collapse into a single model choice. This makes it
practical to test cheaper summarize settings while keeping a stronger judge,
or to isolate generation changes without touching the server defaults.
2026-04-16
CloudDashboard
Cloud dashboard now uses skill-first wording instead of exposing raw source terminology
The cloud dashboard now uses skill-first wording across the library,
detail pages, and observed-to-cloud bridge instead of exposing the backend
source model directly in the UI. Cloud library cards, blocked-state
messages, quick eval setup, and linked-skill surfaces now read as normal
product concepts: Cloud skills, Open Skill, and Linked cloud
skills.This is a terminology cleanup only. The backend cloud_skill_sources model
and API routes are unchanged, but the visible dashboard flow is less
confusing because it no longer asks users to think in storage-layer terms.Observed skill cards now include an Import to Cloud button that navigates
directly to the cloud library with the skill pre-filled for import. The cloud
source detail page also gains an Eval Suites section where you can view
existing suites scoped to a source and create new ones inline with a name,
verifier kind, and JSON test cases — no CLI or API calls required.Proposal detail pages now show full eval comparison data, artifact kinds,
confidence levels, and an Apply to Draft button that closes the loop
from review to draft apply. The jobs page shows Cloud Improve Runs
alongside pipeline jobs with status badges and candidate counts.
2026-04-16
CloudDashboard
Cloud Library now shows cloud authoring records separately from observed telemetry skills
The cloud dashboard’s main
Skills surface now reflects the real cloud
authoring model instead of the telemetry-backed skill table. It lists
GitHub-backed sources, imported uploads, and cloud-managed records with their
current snapshot and capability state.The old telemetry-backed skills library is still available, but it now lives
under Observed so local and cloud concepts do not get mixed together. Cloud
sources without a linked telemetry skill now have their own detail page for
snapshot metadata, validation reports, and hosted improve controls.The cloud library also now includes first-run onboarding directly in the UI:
you can upload a skill folder into a new cloud source, create and sync a
GitHub-backed source from a bound installation, jump into cloud import from
observed skills, and create a lightweight eval suite from the cloud detail
page before queuing a hosted improve run.Observed skill detail pages now also expose that bridge directly: if a skill
already has linked cloud sources you can jump straight into them, and if it
does not you can import it into cloud from the report itself instead of
backing out to the library first.2026-04-16
CloudDashboard
Cloud improve runs can now override the model per run, with cheaper default summarize policy
Cloud improve runs can now override the hosted model policy directly from the
skill page before queueing a run. The selected override applies to the full
run, not just candidate generation, so you can force a cheaper test model or
a stronger one without changing server env defaults.Hosted defaults are also now more cost-aware for testing: generation and
judging stay on
openai/gpt-4.1-mini, while summarize and low-risk helper
work default to openai/gpt-4.1-nano.2026-04-16
Cloud
Cloud skill improvement integration: runner package, eval backends, control-plane wiring, and hosted eval-suite parity
The cloud skill improvement pipeline is now fully wired end-to-end. The
isolated runner package (
@selftune/cloud-improve-runner) connects to the
control-plane orchestrator via concrete dependency adapters. Eval backends
(trigger-query LLM judge + deterministic) dispatch through a registry. Both
draft and GitHub apply paths consume the same candidate archive contract.Hosted eval-suite creation now validates runnable manual suites for both
llm_judge and deterministic verifiers through the same control-plane
contract the runner consumes, which keeps the dashboard and API-key paths on
one canonical suite definition.selftune improveandselftune evolvenow fall back to plain stderr progress lines when the terminal is not a TTY, instead of going completely silent while long proposal or validation steps are still running. - Interactive terminals keep the spinner/TUI behavior, while test runs remain quiet by default.
- Local dashboard action toasts now include a
Live runaction that opens the exact/live-runentry for the streaming creator-loop event, including the event id, skill, and action selection state. - The floatingLive lifecycle actionsfeed now uses the same deep link, so clicking a running or finished lifecycle card jumps straight into the matchingLive Runentry instead of leaving you to find it manually.
selftune eval generatenow accepts--agentfor--synthetic,--auto-synthetic, and--blend, so you can forceopencode,codex, orpiinstead of relying on auto-detection order. - Cold-start synthetic eval generation now reuses the same cleaned query filtering as log-derived evals and summarizes oversizedSKILL.mdcontent before sending it to the runtime, which reduces prompt bloat for large skills likeSelfTuneBlog.
- Bounded package search now writes merged routing/body candidates into a new
temp package snapshot instead of overwriting the already-evaluated body
variant on disk, so candidate artifacts remain consistent for later winner
application and review. -
selftune create publish --watch --ignore-watch-alertsnow also bypasses the watch gate when the watch subprocess crashes or fails to emit structured JSON, while still surfacing the warning and remediation command.
- The OSS local dashboard
LiveRuntest fixture now uses the realDashboardActionResultSummaryshape for bounded package-search summaries, so export verification no longer fails whensearch_runis present on deploy candidate entries.
- The local dashboard now normalizes
selftune create replay,selftune create baseline,selftune evolve,selftune evolve body, andselftune search-runinto the same lifecycle-facing commands the CLI already shows, so Overview, Skill Report, and Live Run no longer leak stage-level command names for draft-package flows.
2026-04-15
OSSCLIDashboard
Package baseline now reuses fresh replay artifacts and emits phase progress
selftune create baseline --mode packagenow reuses the last fresh with-skill replay from the canonical package-evaluation artifact when the draft fingerprint still matches, so measuring baseline no longer pays for two full replay passes after an unchangedverify,report, orsearch-run. - Package baseline now emits explicitwith_skill_replayandwithout_skill_replaystep progress so the local dashboard live-run surface shows immediate movement instead of looking stuck while the underlying replay work is still running.
selftune search-runnow prefers reflective routing/body proposals from measured runtime failures before targeted or deterministic fallback. - When routing and body both produce accepted improvements, package search now evaluates a merged candidate before final winner selection instead of forcing the frontier to choose between complementary single-surface edits. - Plainselftune improvenow auto-selects bounded package search for skills that already have package evidence or a draft package manifest, so agents do not need to force--scope packagefor the main package-shaped lifecycle. - Added an end-to-end package lifecycle test coveringverifyauto-fix, bounded package search, winner promotion, andpublish --watch.
selftune search-runnow uses the same measured targeted-routing/body mutation path as orchestrate package search, falling back to deterministic variants only when targeted variants do not fill the requested minibatch. - The public CLI docs and workflow docs now describesearch-runas bounded local package search over draft variants instead of registry lookup, and theEvolveworkflow now points package-scope users at the measured targeted search path instead of only the older deterministic description. - Publish/package-search lifecycle docs now describe the real blocking publish-time watch gate instead of the older advisory wording.
selftune verifynow auto-runs the real missing-evidence commands with the required flags and skill context, including--auto-syntheticeval generation and generated unit tests. -selftune create publish --watchnow blocks publish if the watch subprocess fails or returns malformed output instead of treating missing watch JSON as a passing gate. - Eval-informed targeted mutations now readgrading_results.pass_rate,expectations_json, andfailure_feedback_jsonfrom the real SQLite schema instead of a test-onlysummary_jsonshape. - The shipped lifecycle docs now describe the actual concrete readiness states and the correct--ignore-watch-alertsflag.
normalizeLifecycleCommandnow mapscreate replay,create baseline,evolve,evolve-body, andsearch-runto their lifecycle equivalents. -selftune --helpnow shows Primary Lifecycle commands first, with Advanced / Stage Commands below.
selftune verifyauto-runs missing evidence steps (up to 4 iterations) when readiness checks fail. Use--no-auto-fixto skip.
collectPackageSearchEligibleSkillsnow includes a second eligibility tier: skills with aselftune.create.jsondraft package and at least 3 grading results in the DB are routed to package search during orchestrate. - The existing frontier/artifact fast path is unchanged; the new tier is additive and fail-open (skips silently if the grading table is missing).
2026-04-15
OSS
Docs: fix stale orchestrate claim in SearchRun.md and document watch frontier demotion
SearchRun.mdno longer claims orchestrate cannot auto-select package search — it documents the eligibility criteria and plan-phase routing. -Watch.mdadds a “How Watch Evidence Feeds Back to the Frontier” section explaining watch rank levels, SQLite row updates, and dashboard visibility. -SKILL.mdSearchRun routing keywords now include “optimize package”, “improve routing and body together”, and “bounded evolution”.
2026-04-15
OSSCLI
Publish watch gate now blocks and mutation weakness extraction populates failure patterns
create publish --watchnow blocks publishing when the watch gate detects active alerts (published: false,watch_gate_blocked: true), instead of unconditionally publishing. Use--ignore-watch-alertsto bypass. -extractMutationWeaknessesnow populatesgradingFailurePatternsfrom theexpectationsarray in grading summary JSON, enabling targeted body mutations to focus on specific failed expectations.
- Orchestrate now marks skills package-search-eligible from the real accepted frontier and canonical package-evaluation artifacts, so the new package-search branch is reachable in normal runs instead of existing only in isolated tests.
- The orchestrate package-search phase now uses the current mutation and
winner-application contracts, including targeted routing/body variants,
current candidate path fields, and the current
applySearchRunWinnerresponse shape. -create publish --watchnow surfaceswatch_gate_passed,watch_gate_warnings, andwatch_trust_scoredirectly in the publish payload, and--ignore-watch-alertsnow intentionally bypasses that advisory gate when needed. - Skill reports now populatewatch_trust_scorefrom the latest stored package-evaluation watch summary, so the dashboard watch trust indicator renders from real watch evidence instead of staying empty. - Fixed theselftune orchestrateCLI docs page so Mintlify renders it as a normal document instead of a raw fenced code block. - Dashboard skill report and live run now display routing and body weakness percentages from surface plan data, with a visual bar highlighting the weaker surface. The frontier panel also shows a parent-vs-winner comparison when both members are available.
- Added evidence-driven scope selection to orchestrate so it automatically chooses between description-level evolve and package-level bounded search based on accepted frontier state and canonical package evaluation evidence. - Added watch trust scoring feedback so post-deploy regressions can demote accepted frontier candidates and influence future scope selection. - Updated workflow and skill documentation to reflect the new package-search-in-orchestrate truth.
- Added deterministic routing mutations (synonym expansion, granularity split, coverage broadening) and body mutations (instruction emphasis, example enrichment, description expansion) for bounded package evolution. - Added eval-informed targeted mutations that consume measured weaknesses from replay failures and grading results to focus routing and body changes on specific failure patterns. - Added weakness extraction from the local SQLite database to surface replay failure samples, routing misses, body quality scores, and grading pass rate deltas for mutation targeting.
- Added
computeWatchTrustScoreto the watch module, producing a 0-1 trust score from trigger regression, grade regression, and rollback signals. - Added an advisory publish watch gate that warns when active alerts or low trust scores are detected, with--ignore-watch-alertsbypass for experts. - Extended the dashboard contract withwatch_trust_scoreon skill reports andwatch_gate_passedon action result summaries. - Updated the live run screen to display watch gate pass/alert badges when watch or deploy actions complete. - Added a watch trust indicator to the skill report creator loop section.
- Added a package-search candidate action to the orchestrate loop so skills with accepted package frontier candidates are routed through bounded package search instead of standard evolution. - The new phase generates bounded mutations, fingerprints variants, runs package search evaluation, and applies winning candidates automatically. - Package search modules are lazy-loaded and gracefully degrade when unavailable, so the existing orchestrate flow is unaffected until the full package search stack is present.
search-runnow treatsbody.quality_score: nullas a neutral weakness signal when the body already passed validation, instead of coercing it to maximum weakness. - This prevents--surface bothfrom over-allocating routing/body search budget toward body mutations when quality assessment was unavailable but the current body was still valid.
search-run --surface bothnow reads the accepted frontier first and falls back to the canonical package evaluation when needed, using that measured package state to bias routing/body candidate counts. - This replaces the old fixed half-routing half-body split with a weakness planner that sends more of the minibatch budget toward the weaker measured surface while still keeping bounded deterministic search behavior. - The chosen surface budget is now persisted into search provenance and shown in the live-run and skill-report frontier surfaces, so reviewers can see why a run spent more budget on routing or body.
- Added
search-run --apply-winner, which copies the winning candidate back into the draft package and refreshes the canonical package-evaluation artifact from the accepted candidate cache instead of leaving search as read-only provenance. -selftune improve --scope packagenow adds winner promotion by default and keeps--dry-runas the review-only escape hatch. - Search-run dashboard summaries now carry the resulting next command and package-evaluation context when a winning candidate is applied, so live review stays grounded in measured package state instead of raw search provenance.
- Added
selftune improve --scope package, which routes the primary improvement alias intoselftune search-runinstead of keeping bounded package search behind an expert-only command. - Package scope now preserves--eval-set, strips redundant--dry-run, normalizes compatible replay validation flags, and maps--candidatesontosearch-run’s--max-candidatesknob. - Updated command help, workflow docs, SKILL routing guidance, and CLI docs so package search is taught as part of the main measured improvement loop.
- Added
selftune search-runas a real top-level CLI command that generates bounded routing/body package variants, evaluates them through the shared package evaluator, and persists the selected winner plus provenance. - Wiredsearch-runthrough dashboard actions, child-process event instrumentation, live-run summaries, and draft-package action buttons so bounded search is executable from the product surface instead of only existing as stored backend state. - The skill report backend now returns real package frontier state and the latest search-run provenance, so the frontier panel is driven by measured candidate history rather than a dormant response field. - Package search evaluations now normalize temp candidate variants back onto the canonical skill name, and winner selection now follows the accepted frontier over the full evaluator contract instead of replay-only gains. - Updated command help, workflow docs, SKILL routing, and the CLI quick reference so the new search surface is documented consistently.
- Updated
selftune statusoutput to label the readiness section “Package pipeline” instead of “Creator loop”. - Adapted package search runner to the mature evaluator API with frontier-based parent selection. - Normalized SKILL.md description and body to reference the package evaluation pipeline (replay, baseline, grading, body, unit tests, and post-deploy watch) as the primary improvement mechanism. - Updated Evolve, EvolveBody, Watch, and CreateTestDeploy workflow docs to use package evaluation pipeline terminology consistently. - Normalized Baseline, Evals, UnitTest, SignalsDashboard workflow docs and creator-playbook reference to use package evaluation pipeline terminology.
- Added package frontier panel to skill report showing accepted candidates ranked by measured evidence with watch-fed demotion indicators. - Added search run panel to live run screen showing selected parent, candidates evaluated, winner determination, and provenance detail. - Added search-run action result parsing to the dashboard action result contract so search runs surface structured summaries alongside existing replay dry-run results.
- Added
generateRoutingMutations()andgenerateBodyMutations()in the evolution pipeline to produce complete skill file variants that a package search runner can score. Three routing strategies (synonym expansion, granularity split, coverage broadening) and three body strategies (instruction emphasis, example enrichment, description expansion) create bounded variants written to temporary directories.
- Added bounded package search runner that evaluates candidate skill variants against the accepted frontier parent with measured delta acceptance. - Added package candidate state management with frontier reading, parent selection, and fingerprint-based deduplication. - Added package search provenance persistence tracking frontier size, parent selection method, candidate fingerprints, and evaluation summaries.
selftune watchnow reads the current package-evaluation artifact when one exists and computes an efficiency regression signal from observed post-deploy sessions, instead of only looking for trigger-pass-rate regressions and optional grade regressions. - Efficiency watch is grounded in measured package baselines already produced bycreate reportandcreate publish, so post-deploy monitoring now compares observed input tokens, output tokens, and assistant turns against the same package-evaluator contract used before publish. - Efficiency regressions now flow through the structured watch result and the nested package watch summary, so publish/watch consumers can surface the same measured signal without scraping alert text. - The local dashboard watch parser now preserves those efficiency-regression fields in the package watch summary, keeping the watch contract forward-ready for richer live-run presentation as more post-deploy package signals land.
- Durable draft package candidates now carry a measured acceptance decision in
local state, instead of only lineage metadata, so candidate history can
distinguish accepted improvements from measured regressions. - Acceptance is
computed from package-evaluator evidence rather than model confidence, with
explicit replay, routing, baseline-lift, body-quality, and unit-test deltas
plus a human-readable rationale attached to the candidate summary. -
Re-evaluating the same draft fingerprint preserves the original parent
relationship instead of inventing a new comparison target, so repeated review
runs update the candidate record without corrupting lineage. - Fresh
candidates now compare their measured acceptance against the latest accepted
frontier member instead of blindly inheriting the most recent rejected draft
as the comparison baseline, while still keeping chronological lineage in the
parent link. - When the current draft matches an already accepted frontier
member, package evaluation can now reuse that candidate-specific artifact by
fingerprint even if the canonical latest package report points at some other
draft, so re-checking an accepted draft no longer repays the full evaluator
cost. - Accepted-frontier selection is now ranked by measured package outcomes
instead of timestamp alone, so newer accepted drafts with weaker grading or
weaker observed health no longer automatically become the comparison parent
for the next candidate. -
create publish --watchnow writes structured watch results back into the matching package candidate artifact and registry row, so observed regressions can demote an accepted draft in later frontier selection without fabricating a brand-new evaluation event. - Cached package-evaluation reuse now also requires acceptance metadata in the stored artifact, so older lineage-only artifacts automatically refresh once before they can participate in candidate-aware reuse. - Benchmark reports,create publishsummaries, and the local dashboard live-run screen now surface the candidate acceptance decision and rationale, so measured accept/reject state is visible without opening archived JSON.
- Fresh draft package evaluations now register a durable package candidate per package fingerprint in local state, instead of only overwriting one latest package report per skill. - New candidate records carry parent linkage to the previously evaluated draft for the same skill plus a candidate-specific archived evaluation artifact, so later bounded package search can reuse lineage and evaluator evidence instead of rebuilding history from ad hoc files. - Cached package-evaluation reuse now requires candidate metadata in the saved artifact too, so older artifacts automatically force one fresh measured run before they can participate in candidate-aware reuse. - Benchmark reports, publish summaries, and the local dashboard live-run view now surface candidate ID, parent linkage, and generation directly, so candidate lineage is inspectable without opening archived JSON artifacts.
- Repositioned the shipped
selftuneskill around a smaller lifecycle:Create,Verify,Publish,Improve, andRun, instead of leading with the older stage-heavy creator loop. - Added new primary workflow docs forVerify,Publish,Improve, andRun, while keeping the existing lower-level eval, replay, baseline, watch, and body-evolution workflows available as advanced surfaces. - UpdatedSKILL.md, routing keywords, and lifecycle-state guidance so “can I trust this skill?”, “ship this skill”, and “run the loop” now map to intention-level workflows that still use today’s commands accurately under the hood. - ReframedCreateas draft authoring only, marked the olderCreateTestDeployworkflow as legacy compatibility guidance, and taughtOrchestrateas the underlying runtime behind the simplerRunconcept. - The local dashboard action stream and dashboard-triggered publish/evolve paths now recognize and use the newverify,publish,improve, andrunaliases where they preserve the same measured behavior, so the live-run UI stays aligned with the simplified lifecycle surface. - The local dashboard overview, skill report, live action feed, and CLI docs now teach draft-package work asverify,publish, and live monitoring first, while still exposing the lower-level eval, replay, baseline, and create-check commands when an agent needs to drive the advanced loop manually. -selftune status, dashboard recommended commands, live-run next-command cards, the shipped quick reference, README, and the main skill-authoring guides now normalize old surface aliases likecreate check,create publish, andorchestrateintoverify,publish, andrunwhen the underlying behavior is equivalent, so the product stops teaching mixed lifecycle vocabulary by default. - Scheduled automation surfaces now teachselftune runas the default autonomous loop entrypoint: cron job messages, generated schedule snippets, alpha-enrollment guidance, orchestration reports, and the related docs and skill workflows all userunfirst while keepingorchestrateas the underlying advanced runtime name where needed. - Fixed theselftune createCLI page after a broken MDX wrapper landed, and updated the main authoring, troubleshooting, sharing, trigger-testing, and creator-playbook docs so they teachverify/publishfirst while still documenting the lower-levelcreate replay/create baselinepackage steps when a draft needs explicit measured proof. - Normalized the secondary advanced workflow docs and README soeval,unit-test,baseline,evolve,evolve body, dashboard live-run, and legacy create-test-deploy guidance now distinguish draft-package lifecycle work from already-published skill iteration, instead of re-teaching the old creator-loop chain as the default. - Cleaned up the remaining lifecycle wording instatus,eval, andcreateCLI docs plus the shippedSKILL.mdreference table, so “creator loop” now mainly survives as a compatibility/search term instead of the default label for the product surface. - Corrected the package-search docs sosearch-runandimprove --scope packageare documented as explicit bounded-search surfaces, without claiming thatrun/orchestratealready auto-select package search before that automation is actually shipped.
- Added a canonical full-evaluation artifact beside the stored package
summary, so
create reportand publish-time package gates can reuse one measured replay/baseline/body-validation result instead of scraping or recomputing partial state. - Package-evaluation reuse is guarded by the bounded package fingerprint and request shape, so edited drafts or changed evaluation requests still trigger a fresh measured run instead of trusting stale evidence. - Cache hits only apply when the saved package artifact already includes the current routing/body validation dimensions, so older summaries automatically fall back to a fresh measured run instead of silently downgrading the review signal. - Benchmark reports and publish output now label whether the package evaluation was freshly measured or reused from a matching artifact cache, so creators can audit reuse instead of inferring it from timing or logs. - The local dashboard live-run summary now surfaces that same fresh-vs-cached evaluation source for package report/publish actions, so cache reuse stays visible in the main review UI too.
- Extended the shared draft package evaluator so
create reportandcreate publishnow attach current routing replay validation and current body validation alongside replay, baseline, grading, unit-test, and watch evidence. - Updated the benchmark-style package report format so routing replay and body validation show up in the same deterministic artifact as the rest of the measured package evidence. - Updated the active bounded package-evolution plan to reflect that body/routing validation is now part of the unified evaluator contract, moving the remaining gap toward candidate state, evaluator reuse, and measured search rather than missing evaluator dimensions.
- Added
selftune create report --skill-path <path>as a no-side-effect package-evaluation command that runs replay plus baseline and renders one benchmark-style report with failure analysis, measured lift, recommendation, and next-step guidance. - Added the same report shape as a reusable helper in the shared draft package evaluator so future dashboard and PR-summary surfaces can reuse one deterministic evidence format instead of inventing ad hoc summaries.
- Updated the selftune skill workflow docs, quick reference,
README, and CLI docs so package creators can explicitly request a measured
publish-readiness report before running
create publish.
- Updated
selftune create publishso draft-package publishing now re-runscreate replay --mode packageandcreate baseline --mode packageas the final measured gate before watch. - Removed the old direct handoff fromcreate publishinto description-onlyselftune evolve, keeping the creator loop grounded in package-level validation instead of a description mutation step. - Added a shared package-evaluation summary thatcreate publishcan return directly, so draft deploy/watch actions have one measured result shape instead of stitching together replay and baseline outcomes ad hoc. - Updated the local dashboard action parser so draft-package baseline and publish runs can surface replay mode, before/after pass rates, and lift on the live run screen. -selftune watchnow emits a machine-readablerecommended_command, andcreate publish --watchnow carries the nestedwatch_resultpayload through directly so draft publish/watch flows expose measured post-deploy pass rates, alerts, and rollback recommendations instead of only a coarse “watch started” status. - Updated creator-loop readiness andselftune statusguidance so draft packages now recommendcreate replay,create baseline, andcreate publishinstead of falling back to the olderevolve/gradecommands for those milestones. - Updated the overview, skill report, andselftune statuscreator-loop surfaces so draft packages stay blocked oncreate checkor package-resource fixes until those checks actually pass, instead of skipping ahead to replay or publish because later creator-loop artifacts already exist. - Added dashboard support forcreate checkas a runnable draft-package action, so the live-run screen and draft package panel can stream and summarize spec-validation checks instead of showing that step as copy-only guidance. - Added structured progress events forcreate check, so the live-run screen now shows draft-package load, Agent Skills validation, and selftune readiness computation as explicit steps instead of only the final JSON result. - Made the overview creator-loop priorities runnable from the dashboard for actionable steps, so top-level draft-package cards can launchcreate check, eval generation, replay, baseline, and publish flows without drilling into the per-skill report first. - Updated the CLI help, OSS workflow docs, and docs site reference so the publish contract matches the package-first creator loop. - The live-run summary tiles now relabel watch actions asBaseline,Observed,Delta, andSignal, so post-deploy watch evidence no longer appears under the older dry-runBefore/After/Validationvocabulary. - The shared package-evaluation payload now carries runtime efficiency and representative evidence, so package replay / baseline / publish flows can return measured duration and token aggregates together with replay-failure and baseline-win samples instead of only pass-rate summaries. - The live-run screen now surfaces those measured package-evaluation artifacts directly, including replay-failure samples, baseline-win/regression samples, with-skill versus without-skill efficiency totals, and recommended next commands when publish or watch actions expose them.
- Added
report-packageas a first-class dashboard action for draft skills, so the skill report and live-run feed can launchselftune create reportdirectly and label the resulting benchmark artifact separately from baseline, publish, and watch runs. create publish --watchnow attaches a structured watch summary to that same package-evaluation payload, and the live-run screen renders watch snapshot counts, invocation-type totals, rollback state, and grade-watch deltas from that shared measured contract.- Clarified the public CLI docs and shipped
Createworkflow so agents can rely on both the raw nestedwatch_resultpayload and the normalizedpackage_evaluation.watchblock when they parse publish-with-watch results. selftune evolveandselftune evolve bodyno longer reject proposals before measured validation solely because model-reported confidence is low;--confidencenow acts as a review threshold and adaptive-gate risk signal instead of a hard pre-validation stop.- The shared package-evaluation payload now also includes grading baseline
versus recent grading deltas when that data exists, so
create reportandcreate publish --jsoncan show observed execution-quality movement next to replay, baseline, and watch evidence. - The local dashboard now parses and renders that same
package_evaluation.gradingblock in live-run summaries, so draft package report and publish flows expose measured grading movement without requiring raw JSON inspection. - The latest package-evaluation summary is now stored canonically in SQLite
and mirrored to
~/.selftune/package-evaluations/<skill>.json, so draft report/publish/watch flows can reuse one measured artifact instead of treating package evaluation as stdout-only output. - Draft-package readiness and
create checknow honor the latest stored package-evaluation status, so a measuredreplay_failedorbaseline_failedresult keeps the skill blocked on the corresponding package gate instead of surfacing a falseready to publishstate just because the older replay or baseline artifacts exist. - The shared package-evaluation payload now also carries deterministic unit
test results and representative failing tests when that evidence exists, so
create report,create publish --json, and the live-run UI can review the latest measured test run alongside replay, baseline, grading, and watch evidence. - Draft-package readiness and
create checknow also honor the latest failed deterministic unit-test run when one exists, so stored test failures keep the draft blocked on rerunning unit tests instead of treating test-file presence alone as publish-ready proof. - Stored package-evaluation artifacts now include a bounded package fingerprint, and draft-package readiness only trusts those replay/baseline results when the fingerprint still matches the current package tree, so stale failed measurements stop blocking edited drafts just because they share the same skill name.
- Fixed dashboard child-process action context for
report-package, socreate reportandverifynow stream live progress and metrics events into the live-run screen instead of silently dropping them when the action context is read from environment variables.
- Added
selftune create initas the clean-slate authoring path for new skills. - Addedselftune create scaffold --from-workflow ...as the workflow-derived authoring path, and upgradedselftune workflows scaffoldto emit the same package shape for backward compatibility. - Package drafts now includeSKILL.md,workflows/default.md,references/overview.md, emptyscripts/andassets/directories, plus aselftune.create.jsonmanifest. - Added
selftune create checkto run Agent Skills spec validation first and then compute selftune-specific package readiness for evals, unit tests, replay, and baseline. - Addedselftune create replay,selftune create baseline,selftune create status, andselftune create publishso the draft-package path now reaches all the way through replay validation, lift measurement, and handoff into the existing evolve/watch surfaces. - Added package-mode replay staging so runtime replay can read workflow/reference files inside the staged skill package without treating them as unrelated paths. - The local dashboard now surfaces draft packages before they have live telemetry, shows package-local create readiness on the skill report, and routes dashboard replay/baseline/ publish actions through the draft-aware create commands automatically. -selftune create checknow recommendscreate replay,create baseline, andcreate publishfor draft-package next steps instead of the older generic evolve/grade commands, keeping package-tree staging consistent from CLI output through the dashboard. - Hardened the local dashboard draft-package views so the exported OSS app typechecks cleanly when create-readiness data is optional, preserving the draft-package panels in shipped builds. - Fixedselftune workflows scaffold --writeso fresh workflow-derived packages are written through the shared draft-package writer instead of pre-creating the directory and tripping the overwrite guard. - Draft-package dashboard actions now start eval generation with--auto-synthetic, so cold-start skills can bootstrap eval sets from the dashboard instead of attempting empty log-based generation. - Added agent workflow docs and public CLI docs so agents can route package authoring requests to the full command surface.
- Added GitHub App installation binding for cloud orgs so a team can associate
a GitHub installation with its registry workspace. - Added GitHub-backed
registry connection APIs for listing accessible repos, connecting a repo to a
registry entry, disconnecting it, and requesting manual sync. - Added
immediate manual sync publishing so a connected repo path is packaged from
GitHub, archived, and pushed into the registry as a GitHub-sourced version
without waiting on a background worker. - Added webhook-driven auto-publish
for default-branch pushes and matching Git tags so connected repos now flow
into the registry without manual sync. - Added a dashboard GitHub settings
flow with installation binding, repo discovery, monorepo path selection, and
connection management controls. - Added Tier A GitHub write-back with
org-level policy, per-connection opt-in, persisted publish attempts, and
optional commit status/check-run updates for successful, skipped, and failed
publishes. - Added direct
selftune registry install github:owner/repo[@ref][//path]support so skills can be installed straight from GitHub with monorepo path discovery when the cloud registry is not part of the flow. - Fixed direct root installs from GitHub so a missingname:in root-levelSKILL.mdfalls back to the actual repository name instead of the temporary clone directory name. - Restored the expected indentation inselftune registry --helpso the usage block matches the rest of the CLI help formatting. - Polished the cloud GitHub settings experience with branded action buttons, clearer installation action states, a consolidated production setup runbook, and lowercaseselftunebranding on key cloud surfaces. - Added signed GitHub webhook intake plus registry source metadata fields so GitHub-origin publishes can be tracked separately from CLI-pushed versions. - Hardened GitHub webhook handling so tag patterns reject unsafe multi-wildcard shapes and webhook deliveries return immediately while publish processing continues asynchronously.
- Moved canonical eval sets, generated unit tests, and unit-test run results
into SQLite as the primary local source of truth for creator-loop readiness. -
Kept mirroring those artifacts into the legacy
~/.selftune/eval-sets/and~/.selftune/unit-tests/JSON files so existing file-based workflows and commands still work during the transition. - Updated readiness/status surfaces to prefer SQLite-backed artifacts instead of depending on filesystem existence checks.
- Updated dashboard-triggered
generate-evalsto pass the canonical~/.selftune/eval-sets/<skill>.jsonoutput path explicitly instead of relying on a relative fallback filename. - Updated dashboard-triggered
generate-unit-teststo pass the canonical~/.selftune/unit-tests/<skill>.jsonpath explicitly as well, keeping readiness artifacts out of the repo working directory.
- Fixed local dashboard rollback actions to spawn
selftune evolve rollbackwith the expected proposal arguments, matching the actual CLI command surface. - Added a dashboard regression test that asserts the rollback action uses the
evolve rollbacksubcommand shape.
- Removed the forced background fill from the sticky
Evolutionheading in the shared skill report evidence rail so proposal views keep the intended transparent panel treatment while scrolling.
- Added a shared dashboard action instrumentation layer so creator-loop
commands can emit structured step progress, LLM call progress, and
provider-normalized runtime metadata without hard-coding the dashboard to one
provider. - Wired
selftune eval generateandselftune eval unit-test --generateinto that shared observer path so the live-run screen can show load/build/write steps plus provider/model/duration updates instead of only terminal output. - Generalized the live-run UI from replay-only wording to a broader action-progress surface while keeping replay as the richest source of token and cost detail.
- Added cached update availability metadata to the local dashboard health
surface so the dashboard can tell the difference between up-to-date,
auto-update-capable installs and manual-refresh source-tree installs. - Added
a passive
Update availablestatus chip in the local dashboard footer plus a dedicated update panel on/status, keeping version visibility available without polluting live creator-loop transcripts.
- Fixed proposal selection so opening a proposal link no longer gets overwritten by an automatic fallback selection. - Removed eager proposal auto-focus during initial load to keep deep links stable. - Kept readiness-driven action prioritization aligned with the active proposal focus state so child action sections no longer shift unexpectedly.
- Suppressed unsupported auto-update chatter during local source-tree runs so
dashboard-triggered creator-loop actions no longer flood the live log with
manual refresh instructions. - Updated OpenCode ingest to support the current
SQLite schema, including
time_createdtimestamps and JSON-backed message rows, instead of assuming legacycreated/contentcolumns.
- Added a live action feed in the local dashboard so creator-loop runs show
start, progress, and finish states instead of only appearing after the next
data refresh. - Added a dedicated live-run screen for creator-loop actions so
replay dry-runs can stream output, show parsed lift summaries, and display
model/platform/token context beside the terminal log. - Added structured
replay metrics to the live dashboard stream so Claude runtime replay now
reports per-run platform, model, token, cost, and duration data in real time
instead of only terminal text. - Added per-eval replay progress streaming and
SSE backfill so the live-run screen can show
eval n/N, query snippets, and pass/fail evidence even when you open the page after the run has already started. - Added dashboard action buttons for the main creator loop on skill reports: generate evals, generate unit tests, replay dry-run, baseline measurement, deploy, and watch. - Added a shared local action stream so supported terminal-runselftunecommands also appear in the dashboard without being launched from the UI. - Fixed replay dry-runs so validatedevolve --dry-runruns surface as success in the live dashboard feed even when the CLI exits non-zero to avoid accidental deployment.
- Repaired the OSS publish pipeline so npm releases can still generate SBOMs, GitHub tags, and enriched release notes even when a publish partially succeeds. - Blocked cloud dashboard indexing and added changelog coverage enforcement so shipped product changes are documented before they merge. - Opened registry publishing and rollback to Pro plans so solo skill creators can publish and iterate without upgrading to Team first. - Tightened the local dashboard skill report around proposal deep links, kept proposal-focused layouts stable while report data loads, prevented raw ENOENT errors during SPA reloads, and restored full-width creator loop layout on overview. - Unified cloud and OSS skill report styling around the shared trust status language by restoring trust panel order, removing leftover success-green treatments, and switching trust badges to the app-wide dot-and-pill status treatment.
- Added universal hook adapters for Codex, OpenCode, and Cline so selftune can capture real-time telemetry beyond Claude Code. - Added cold-start suspicion and Claude runtime replay validation to make trigger diagnostics more trustworthy when a skill has little history. - Hardened OpenCode installation so hook setup follows current plugin and config behavior instead of relying on rejected config keys. - See the OSS releases for package artifacts and per-version compare links.
- Overhauled dashboard, trust, and creator-facing contribution surfaces so health signals are easier to interpret during active iteration. - Tightened the autonomous evolve and audit path to close reliability gaps in proposal rollout and monitoring. - Added CLI auto-update, richer structured errors, description quality scoring, and unblock suggestions for faster operator recovery.
- Added full skill body evolution so selftune can refine routing tables and larger skill bodies instead of only short descriptions. - Added synthetic eval generation to help new skills bootstrap without waiting for a large session history. - Introduced cheaper validation loops, activation rules, specialized agents, and a live local dashboard server for faster iteration. - Read more in the evolution concept guide and the dashboard command reference.
- Added
selftune statusandselftune lastso you can check skill health without opening the full dashboard. - Added a local dashboard and Claude transcript backfill to make retroactive analysis practical on existing projects. - Added opt-in community export so you can share anonymized signals back to the ecosystem.
- Shipped the initial CLI with
init,grade,eval,evolve,watch,doctor, and platform ingest commands. - Added Claude Code hooks for prompt capture, skill evaluation, and end-of-session telemetry. - Introduced the initial observe → detect → evolve → watch loop that the rest of the product builds on today.