The agent that wouldn't stop retrying
A single retry loop cost us $2,400 in model calls overnight. The bug was one line. The guardrail was the part we were missing.
At 11:42pm on a Thursday, an agent we’d been running for three weeks went into a retry loop and didn’t come out. By the time our budget alarm fired at 3:17am, we’d burned through the entire month’s model budget plus a chunk of the next one. The agent’s task was trivial: summarize an inbound ticket and post it to a Slack channel.
What actually happened
The ticket parser hit a ValidationError on a field it had never seen before. The agent’s on_error handler was configured to retry with a back-off. The back-off capped at 8 seconds. There was no retry ceiling.
# agents/ticket_summary.py
@agent(name="ticket_summary")
class TicketSummary:
retry = ExponentialBackoff(base=1, cap=8) # no max_attempts
on_error = Retry(retry)
budget = None # no budget — this is the bug
The fix was three lines. The harder question is why the reviewer (me) approved the PR with budget = None in it, and the answer isn’t “I missed it” — it’s that our review checklist didn’t require it.
The guardrail
We now enforce three properties on every agent at registration time, not at runtime:
- A hard dollar ceiling per invocation and per day, below which the agent is allowed to fail loudly.
- A max-attempts count that cannot be
None. - An idempotency key on every external effect the agent can produce.
The cheapest guardrail is the one a human can’t forget to add, because it’s a type error if it’s missing.
What I’d tell past-me
Mocked tests will pass. Dry-runs will pass. The only thing that catches runaway retry loops is a registration-time invariant that makes the unsafe configuration unrepresentable. We’ve since ported the same pattern to our prompt-cache invalidation pipeline and two internal tools.
The postmortem template we now use is in /ops/postmortem-template. It’s short enough to fill out at 4am, which is the only time it actually matters.