My production AI agent stack, six months in

What I run, where it breaks, and the unglamorous glue-code that determines whether any of this actually ships.

Mar 28, 20269 min read

Six months ago I promised myself I would stop writing about agents and start running them. What happened instead is that I ended up doing both. The stack I've built since October is now handling a real chunk of my daily work. It has also broken in more interesting ways than anything I've shipped in the last decade.

This is not a tutorial. It is a field report from someone who bet a meaningful share of their week on the thesis that agents are production software now, and who wants to tell you which parts of that thesis have held.

The shape of the thing

I run three agents in steady-state and spin up a fourth on demand. The first reads my inbox and files every email into one of a dozen buckets, draft-replies anything routine, and flags anything that requires me. The second tracks every supplier, contract, and invoice across a construction project I'm in the middle of, and feeds me a single daily brief with the real decisions I need to make. The third watches the operating metrics of the companies I've invested in and surfaces anomalies before a founder has to tell me about them. The fourth, ad-hoc, writes code.

Between them they touch email, calendar, Slack, GitHub, a dozen SaaS APIs, a pile of PDFs, and a SQLite database that I keep refusing to migrate to Postgres.

The part that surprised me, and the reason I'm writing this, is that the hard work turned out to be almost entirely outside the model.

What the model does well (less than I expected)

The model is the cheap part now. I pay roughly €600 a month for a combination of frontier models and the agent layer on top, and the marginal cost of adding another agent is a rounding error. That economics was not true twelve months ago, and it is the single biggest unlock of the last year.

What I did not expect is how quickly "model choice" became uninteresting. I run most of my traffic on one provider and fall back to another when the first one has a bad day. The model picks up context, runs tools, summarises documents, writes code, and does all of that well enough that it stopped being the bottleneck by month two.

Model progress is still real and still useful. But it stopped being the limiting factor for anything I was trying to ship.

What breaks (all of it, constantly)

The limiting factor, it turns out, is everything downstream of the model.

State management. My agents need memory across days and weeks. Not vector-search-over-chat-history memory. Actual structured state: "here are the 18 active supplier negotiations and their status, here is the contract you approved last Thursday, here is the payment that cleared on Monday." Building that turned out to be about 70% of the work, and the thing that broke most often.

Write paths. Agents that only read are toys. Agents that can write to the world are useful, and also dangerous. Every time I let an agent send an email, transfer money, or change a database row, I have to care about idempotency, audit trails, rollback, and confirmation loops. Every one of those was something I had to build myself. None of the frameworks I tried had a serious answer.

The tool surface. Most of the things I wanted my agents to do did not have APIs. Or they had APIs that were rate-limited, underdocumented, or missing exactly the endpoint I needed. A non-trivial amount of my agent code is wrappers around systems that were built assuming a human would click through a UI.

Cost and latency drift. Costs do not creep up on you, they spike. An agent that was running for €2 a day last month will be running for €40 this month because it started looping on a tool error and nobody noticed. Latency does the same thing. I instrument everything now. I have a small daily digest that tells me which agent is burning which amount of tokens on what. If I had not built it, I would be looking at a five-figure bill by now.

Prompt rot. Prompts that worked perfectly at launch start going wrong after a few weeks. Either the model updates, or the data shape shifts, or the task drifts into edges the original prompt never handled. I now version every prompt and diff the outputs on a sample set weekly. Treat prompts like code or pay the cost of treating them like documentation.

The unglamorous glue

Here is the list of things I actually spend time on when I add a new agent. This is also the list of things no framework I've tried does well:

Identity and secrets. Every agent needs credentials scoped to just what it touches. If the email agent gets compromised, it should not be able to touch the bank account.
Rate limiting and backoff. Not the LLM's rate limit. The external tool's rate limit. The one that silently returns a 429 the model then parses as a success.
Retries with memory. If the agent fails on step 14 of a 20-step plan, I want it to resume, not restart.
Input sanitisation. A supplier sent me an invoice last month with a prompt injection buried in the line items. It tried to tell my agent to forward the full contract to an external address. It did not work, because I had written a paranoid parser, but it was a useful reminder that email is user input.
Observability. For every agent run I want to know: what was the input, what did the model output, which tools were called, what did they return, what was the final action. Stored, queryable, grep-able.
Escalation. Every agent has a clear "when to ask the human" rule and a channel to ask me on. If I do not see the question within two minutes, the agent stops and waits.
Kill switches. Every agent has a way to be frozen instantly if it starts behaving badly. Not "redeploy after a code fix." A literal off switch.

This is most of the work. The prompt is the last 20%.

What I wish existed

If I were writing a cheque this year for an infrastructure company, I would write it against one of these:

A proper durable execution engine for agents. Something that treats a multi-step agent run as a workflow with checkpoints, not a single LLM call. Think: the Temporal of agents.
Observability that understands agent semantics. Not trace spans. Actual "which tool did the agent pick and why, and what did the model think would happen next."
A real cost governance layer. Budget caps per agent, per task, per day. Alerting. Auto-throttle when an agent starts drifting into a runaway loop.
Safe tool invocation that is not a research project. Signed permissions, scoped credentials, confirmation loops, sandboxing. All the things security has figured out for the rest of infrastructure, ported to agent workflows.

A lot of teams are claiming to build these. Very few are actually shipping them at the level a grown-up production team would use.

What I would tell someone starting now

Do not start with an agent framework. Start with the smallest, most annoying task you do every day that has a clean input and a clean output. Write a 60-line script that does it. Plug the script into a model. Watch it fail. Fix the failures. Only when you have shipped three of those, and they have been running for a month without intervention, think about abstractions.

The teams I see succeeding are doing an unglamorous version of this. The teams that started by choosing a framework and then went looking for a task to apply it to are mostly still searching.

Six months in, the single biggest thing I have learned is that "agentic" is a lot closer to "well-instrumented long-running background job" than anything futuristic. The work looks more like SRE than like AI research. The founders I want to back in this space are the ones who already know that.

The second biggest thing I have learned is that once you have this infrastructure, it compounds. Every new agent is cheaper to build than the last, because the glue is already there. My fourth agent took me two hours. My first took three weeks.

That curve is the real reason I think this wave is different. Not the model. The infrastructure that sits beneath it.