Skip to main content
An Agent Company is only useful if it improves outcomes in practice. The right eval loop tests both company structure and actual task execution.

What to evaluate

A good evaluation set covers more than final prose output. Depending on the company, test:
  • import preview quality
  • company graph resolution
  • skill attachment behavior
  • task execution quality
  • output artifacts
  • token and time cost

Start with realistic test cases

Each eval case should include:
  • a realistic user or operator prompt
  • the company or repo path being evaluated
  • expected outputs or behaviors
  • optional input files
Examples:
  • import an Agent Company into a new environment and inspect the preview tree
  • attach engineering skills to an agent and compare desired vs actual state
  • execute a recurring planning task with and without the company

Compare against a baseline

Run each case at least two ways:
  • with the current company
  • without the company or with the previous version
This tells you whether the company is adding value rather than just consuming more context.

Write objective assertions first

Prefer checks like:
  • expected manifests were discovered
  • skill shortnames resolved correctly
  • import preview shows the intended create or update actions
  • a report file or artifact exists
  • the output includes required sections
Add human review after that for broader questions like usefulness, clarity, or whether the output reflects the intended company behavior.

Track cost and drift

Collect per-run data such as:
  • pass rate
  • failure category
  • duration
  • total tokens
  • whether the adapter or runtime state matched the company intent
That last point matters because desired state in the manifests may diverge from actual runtime state.

Use failures to refine the company

Read failures at three levels:
  • company design: wrong boundary between company, team, agent, and skill
  • instructions: unclear role behavior or missing defaults
  • tooling: weak import preview, weak pinning, weak sync visibility
If the same logic is being reinvented in every run, that is usually a sign to improve the company structure, instructions, or bundled references.

The loop

  1. run the eval set with and without the company
  2. grade objective assertions
  3. review outputs and execution traces
  4. tighten manifests, descriptions, or bundled resources
  5. rerun and compare the delta
Stop when the Agent Company improves outcomes consistently and the extra context cost is justified.