Incoming · 04/22/26 · 18:56 PDT

Field notes from a late-night terminal.

Systems notes, incident write-ups, and agent-pipeline teardowns from an AI solutions architect. No thinkpieces. No roadmaps. Just things that broke, things that worked, and the guardrails I wish I'd had first.

Recent signals

1 / 1
Entry preview — /ai/agents/0001
23:42 · 04/19/26 · /ai/agents · entry 0001 · ~7 min

The agent that wouldn't stop retrying

A single retry loop cost us $2,400 in model calls overnight. The bug was one line. The guardrail was the part we were missing.

At 11:42pm on a Thursday, an agent we’d been running for three weeks went into a retry loop and didn’t come out. By the time our budget alarm fired at 3:17am, we’d burned through the entire month’s model budget plus a chunk of the next one. The agent’s task was trivial: summarize an inbound ticket and post it to a Slack channel.

What actually happened

The ticket parser hit a ValidationError on a field it had never seen before. The agent’s on_error handler was configured to retry with a back-off. The back-off capped at 8 seconds. There was no retry ceiling.

# agents/ticket_summary.py
@agent(name="ticket_summary")
class TicketSummary:
    retry = ExponentialBackoff(base=1, cap=8)   # no max_attempts
    on_error = Retry(retry)
    budget = None                               # no budget — this is the bug

The fix was three lines. The harder question is why the reviewer (me) approved the PR with budget = None in it, and the answer isn’t “I missed it” — it’s that our review checklist didn’t require it.

The guardrail

We now enforce three properties on every agent at registration time, not at runtime:

  • A hard dollar ceiling per invocation and per day, below which the agent is allowed to fail loudly.
  • A max-attempts count that cannot be None.
  • An idempotency key on every external effect the agent can produce.

The cheapest guardrail is the one a human can’t forget to add, because it’s a type error if it’s missing.

What I’d tell past-me

Mocked tests will pass. Dry-runs will pass. The only thing that catches runaway retry loops is a registration-time invariant that makes the unsafe configuration unrepresentable. We’ve since ported the same pattern to our prompt-cache invalidation pipeline and two internal tools.


The postmortem template we now use is in /ops/postmortem-template. It’s short enough to fill out at 4am, which is the only time it actually matters.