How to build an AI agent that actually ships

First, what an agent actually is

A chatbot answers. An agent acts. The difference is tools: an agent can look things up, call APIs, write to databases, and check its own work — then decide what to do next based on what it found. That decide–act–observe cycle is the whole trick:

while task is not done:
    response = model(conversation, available_tools)
    if response requests a tool:
        result = run_tool(response.tool, response.arguments)
        append result to conversation
    else:
        return response   # the agent believes it's finished

That's a real production loop, minus error handling. If a diagram of your "agent architecture" doesn't reduce to this, it's usually a workflow with extra steps — which is fine, but call it that.

Step 1 — Start from the task, not the model

The biggest agent failures we've seen were scoping failures, not model failures. Before writing a line of code, answer three questions:

Is the task verifiable? "Resolve this support ticket" can be checked. "Make the customer happy" cannot. Agents thrive on tasks where success is observable.
What's the blast radius of a mistake? Drafting an email a human reviews: low. Issuing refunds automatically: high. Start low, earn your way up.
Would a human expert need the same tools? If a person couldn't do the job with the access you're giving the agent, the agent can't either. No model reads minds or invisible databases.

Step 2 — Pick a model, and budget for it

Use the most capable model you can afford for reasoning-heavy steps, and a cheaper, faster one for high-volume simple steps — classification, extraction, routing. A surprising amount of "agent" work is the second kind. Splitting by difficulty routinely cuts inference cost by more than half without hurting outcomes.

Two practical notes: keep the model swappable behind one interface (providers leapfrog each other every few months), and measure latency end-to-end — users experience the loop, not a single call.

Step 3 — Design tools like you design APIs

Tools are where agents are won or lost. The model only sees each tool's name, description and parameters — so write those like documentation for a sharp new hire:

Few and focused beats many and vague. Five well-named tools outperform twenty overlapping ones. Overlap makes the model dither.
Return errors the model can act on. "order_id not found — IDs look like TM-1234" lets the agent self-correct. A bare 500 teaches it nothing.
Make read and write tools obviously different. We name them get_* / search_* vs create_* / update_*, and gate the writes (more below).

Step 4 — Memory: less than you think

Most agents need exactly two kinds of memory: the conversation itself (short-term, free) and retrieval over your knowledge (long-term — documents, tickets, product data behind a search tool). Fancy episodic memory architectures are rarely the bottleneck. What matters more is context discipline: summarise or truncate old tool results so the loop doesn't drown in its own history. Long, cluttered contexts degrade reasoning well before you hit the token limit.

Step 5 — Guardrails are the product

An agent without guardrails isn't bold — it's just untested.

The layers we ship with every agent, in order of importance:

Permission boundaries. The agent's credentials can only touch what it should. Enforced in infrastructure, not in the prompt — prompts are requests, IAM is law.
Human approval on irreversible actions. Sending money, deleting records, emailing customers: the agent prepares, a person confirms. Relax this only with evidence.
Step and budget caps. Cap loop iterations and spend per task. A confused agent should fail fast and escalate, not retry for an hour.
Output checks. Validate structure, scan for leaked PII, and verify claims against the tool results actually returned — that last one quietly catches most hallucination.

Step 6 — Evals before launch, traces after

Build a test set of 30–50 real scenarios — including the ugly ones: ambiguous requests, missing data, users trying to break it. Score outcomes ("was the ticket resolved correctly?"), not vibes. Run it on every prompt or tool change; agents regress in ways code review won't catch.

In production, log every step of every loop — the full trace of thoughts, tool calls and results. When an agent misbehaves (one will), the trace turns a mystery into a bug report.

◆ ◆ ◆

The failure modes nobody warns you about

The over-eager agent acts when it should ask. Fix the tool descriptions and add confirmation gates — don't just plead in the system prompt.
The loop that never ends — retrying a failing tool forever. Step caps plus error messages that say why the call failed.
The silent degrader — a provider model update subtly shifts behaviour. Your eval suite is the smoke alarm; without one you find out from customers.
The demo–production gap — flawless on five happy-path examples, lost on real input. Only real-world evals close it.

Start with one narrow, verifiable, low-blast-radius task. Ship it with guardrails, watch the traces, widen the scope as it earns trust. That's the unglamorous version — it's also the one that works.

§ Work with us

Want an agent in production, not in a deck?

Our AI Studio designs, builds and operates production agents — model selection, tools, guardrails, evals, the lot. Handcrafted, not generated.

Explore the AI Studio → Start a project

← Back to the Journal

How to build an AI agent that actually ships.