Evaluating Agent Companies

An Agent Company is only useful if it improves outcomes in practice. The right eval loop tests both company structure and actual task execution.

What to evaluate

A good evaluation set covers more than final prose output. Depending on the company, test:

import preview quality
company graph resolution
skill attachment behavior
task execution quality
output artifacts
token and time cost

Start with realistic test cases

Each eval case should include:

a realistic user or operator prompt
the company or repo path being evaluated
expected outputs or behaviors
optional input files

Examples:

import an Agent Company into a new environment and inspect the preview tree
attach engineering skills to an agent and compare desired vs actual state
execute a recurring planning task with and without the company

Compare against a baseline

Run each case at least two ways:

with the current company
without the company or with the previous version

This tells you whether the company is adding value rather than just consuming more context.

Write objective assertions first

Prefer checks like:

expected manifests were discovered
skill shortnames resolved correctly
import preview shows the intended create or update actions
a report file or artifact exists
the output includes required sections

Add human review after that for broader questions like usefulness, clarity, or whether the output reflects the intended company behavior.

Track cost and drift

Collect per-run data such as:

pass rate
failure category
duration
total tokens
whether the adapter or runtime state matched the company intent

That last point matters because desired state in the manifests may diverge from actual runtime state.

Use failures to refine the company

Read failures at three levels:

company design: wrong boundary between company, team, agent, and skill
instructions: unclear role behavior or missing defaults
tooling: weak import preview, weak pinning, weak sync visibility

If the same logic is being reinvented in every run, that is usually a sign to improve the company structure, instructions, or bundled references.

The loop

run the eval set with and without the company
grade objective assertions
review outputs and execution traces
tighten manifests, descriptions, or bundled resources
rerun and compare the delta

Stop when the Agent Company improves outcomes consistently and the extra context cost is justified.

For package authors

For client implementers

What to evaluate

Start with realistic test cases

Compare against a baseline

Write objective assertions first

Track cost and drift

Use failures to refine the company

The loop

For package authors

For client implementers

​What to evaluate

​Start with realistic test cases

​Compare against a baseline

​Write objective assertions first

​Track cost and drift

​Use failures to refine the company

​The loop

What to evaluate

Start with realistic test cases

Compare against a baseline

Write objective assertions first

Track cost and drift

Use failures to refine the company

The loop