Private pilot now forming

Agent Chaos

Deterministic chaos testing for tool-using AI agents.

If an agent can’t handle failure in staging, it won’t survive production.

Agent Chaos is a reliability testing harness for teams building AI agents that call tools, APIs, and external systems. It helps engineering, platform, and QA teams test how agents behave when the real world gets messy: timeouts, schema drift, contradictory tool data, partial payloads, and adversarial outputs.

Run the same suite against a baseline and a candidate build. If reliability drops, fail the release before customers find the failure.

Instead of relying on one-off prompt checks or anecdotal demos, Agent Chaos gives teams a deterministic way to simulate failure, measure agent reliability, and catch regressions before launch. Run repeatable suites, generate scorecards, compare against baselines, and use the results as a real release gate in CI.

Test the failures demos miss

Agent Chaos runs your agent through realistic multi-step workflow tests under controlled failure conditions. Every run is reproducible, so teams can replay failures, debug them quickly, and verify fixes with confidence.

What it tests

tool timeouts and retry storms
schema drift and partial payloads
contradictory tool responses
unsafe trust in tool-provided text
regression against previous reliability baselines
optional OpenAI-compatible BudgetGuard proxy controls for budget, rate, and safety policy

Why it matters

Most agent failures do not come from simple Q&A quality. They happen inside tool loops: retries spiral, schemas drift, APIs return partial data, or an agent trusts the wrong tool output at the wrong time. These failures are expensive, hard to reproduce, and usually discovered too late.

Agent Chaos is built to make those failure modes testable.

With Agent Chaos, teams can

standardize reliability gates across agent workflows
catch regressions before merge or deployment
reproduce intermittent failures with fixed seeds
understand why an agent failed, not just that it failed
add budget, rate, and policy controls around model usage

Who it’s for

platform teams building shared agent infrastructure
QA and automation teams responsible for release quality
product teams shipping agents that depend on tools and external APIs

Private pilot

We are looking for teams shipping agents that depend on tools, APIs, or external systems. A private pilot includes one workflow suite, one baseline report, one CI release gate, and a review of the failure modes uncovered.

The package installs as hilo-agent-chaos, while the CLI remains agent-chaos.

Pilot scope includes:

packaged built-in suites for common workflow categories
CI-ready reporting and regression controls
signed suite packs for controlled distribution
policy-aware proxy deployment for budget and governance controls
documentation, examples, and launch materials for early adopters

What a pilot produces

The first milestone is a working reliability gate your team can rerun: one representative workflow, a fixed-seed failure suite, a baseline score, and a report that explains the regressions worth fixing.

Buyer-facing output

baseline scorecard for a real agent workflow
replayable failures with seed, injector, and step details
Markdown or HTML report for engineering review
CI threshold that can block a risky release
follow-up recommendations for retry, schema, policy, and budget controls

suite: support_triage_v1
seed: 20260514
baseline reliability: 0.91
candidate reliability: 0.78
release gate: fail

regressions:
- timeout recovery dropped 18%
- partial payload handling dropped 11%
- unsafe tool-text trust observed in 2 runs

Agent Chaos Reliability Pilot

Package this into a focused reliability engagement: one representative agent workflow, a fixed-seed chaos suite, a baseline scorecard, and a CI-ready release gate recommendation.

View the Agent Chaos pilot package

Request Agent Chaos pilot access

Tell us about the workflow, tools, and release gate you want to test first. We’ll follow up about pilot fit, scope, and next steps.

Thanks, we got it. We’ll review the workflow details and follow up about pilot fit and next steps.

Please enter a valid email address.

Explore ChangeDiff