Agent Chaos
Deterministic chaos testing for tool-using AI agents.
If an agent can’t handle failure in staging, it won’t survive production.
Agent Chaos is a reliability testing harness for teams building AI agents that call tools, APIs, and external systems. It helps engineering, platform, and QA teams test how agents behave when the real world gets messy: timeouts, schema drift, contradictory tool data, partial payloads, and adversarial outputs.
Run the same suite against a baseline and a candidate build. If reliability drops, fail the release before customers find the failure.
Instead of relying on one-off prompt checks or anecdotal demos, Agent Chaos gives teams a deterministic way to simulate failure, measure agent reliability, and catch regressions before launch. Run repeatable suites, generate scorecards, compare against baselines, and use the results as a real release gate in CI.
Test the failures demos miss
Agent Chaos runs your agent through realistic multi-step workflow tests under controlled failure conditions. Every run is reproducible, so teams can replay failures, debug them quickly, and verify fixes with confidence.
What it tests
- tool timeouts and retry storms
- schema drift and partial payloads
- contradictory tool responses
- unsafe trust in tool-provided text
- regression against previous reliability baselines
- optional OpenAI-compatible BudgetGuard proxy controls for budget, rate, and safety policy
Why it matters
Most agent failures do not come from simple Q&A quality. They happen inside tool loops: retries spiral, schemas drift, APIs return partial data, or an agent trusts the wrong tool output at the wrong time. These failures are expensive, hard to reproduce, and usually discovered too late.
Agent Chaos is built to make those failure modes testable.
With Agent Chaos, teams can
- standardize reliability gates across agent workflows
- catch regressions before merge or deployment
- reproduce intermittent failures with fixed seeds
- understand why an agent failed, not just that it failed
- add budget, rate, and policy controls around model usage
Who it’s for
- platform teams building shared agent infrastructure
- QA and automation teams responsible for release quality
- product teams shipping agents that depend on tools and external APIs
Private pilot
We are looking for teams shipping agents that depend on tools, APIs, or external systems. A private pilot includes one workflow suite, one baseline report, one CI release gate, and a review of the failure modes uncovered.
The package installs as hilo-agent-chaos, while the CLI remains agent-chaos.
Pilot scope includes:
- packaged built-in suites for common workflow categories
- CI-ready reporting and regression controls
- signed suite packs for controlled distribution
- policy-aware proxy deployment for budget and governance controls
- documentation, examples, and launch materials for early adopters
What a pilot produces
The first milestone is a working reliability gate your team can rerun: one representative workflow, a fixed-seed failure suite, a baseline score, and a report that explains the regressions worth fixing.
Buyer-facing output
- baseline scorecard for a real agent workflow
- replayable failures with seed, injector, and step details
- Markdown or HTML report for engineering review
- CI threshold that can block a risky release
- follow-up recommendations for retry, schema, policy, and budget controls
suite: support_triage_v1
seed: 20260514
baseline reliability: 0.91
candidate reliability: 0.78
release gate: fail
regressions:
- timeout recovery dropped 18%
- partial payload handling dropped 11%
- unsafe tool-text trust observed in 2 runs
Agent Chaos Reliability Pilot
Package this into a focused reliability engagement: one representative agent workflow, a fixed-seed chaos suite, a baseline scorecard, and a CI-ready release gate recommendation.
Request Agent Chaos pilot access
Tell us about the workflow, tools, and release gate you want to test first. We’ll follow up about pilot fit, scope, and next steps.