First, what an agent actually is
A chatbot answers. An agent acts. The difference is tools: an agent can look things up, call APIs, write to databases, and check its own work — then decide what to do next based on what it found. That decide–act–observe cycle is the whole trick:
while task is not done:
response = model(conversation, available_tools)
if response requests a tool:
result = run_tool(response.tool, response.arguments)
append result to conversation
else:
return response # the agent believes it's finished
That's a real production loop, minus error handling. If a diagram of your "agent architecture" doesn't reduce to this, it's usually a workflow with extra steps — which is fine, but call it that.
Step 1 — Start from the task, not the model
The biggest agent failures we've seen were scoping failures, not model failures. Before writing a line of code, answer three questions:
- Is the task verifiable? "Resolve this support ticket" can be checked. "Make the customer happy" cannot. Agents thrive on tasks where success is observable.
- What's the blast radius of a mistake? Drafting an email a human reviews: low. Issuing refunds automatically: high. Start low, earn your way up.
- Would a human expert need the same tools? If a person couldn't do the job with the access you're giving the agent, the agent can't either. No model reads minds or invisible databases.
Step 2 — Pick a model, and budget for it
Use the most capable model you can afford for reasoning-heavy steps, and a cheaper, faster one for high-volume simple steps — classification, extraction, routing. A surprising amount of "agent" work is the second kind. Splitting by difficulty routinely cuts inference cost by more than half without hurting outcomes.
Two practical notes: keep the model swappable behind one interface (providers leapfrog each other every few months), and measure latency end-to-end — users experience the loop, not a single call.
Step 3 — Design tools like you design APIs
Tools are where agents are won or lost. The model only sees each tool's name, description and parameters — so write those like documentation for a sharp new hire:
- Few and focused beats many and vague. Five well-named tools outperform twenty overlapping ones. Overlap makes the model dither.
- Return errors the model can act on.
"order_id not found — IDs look like TM-1234"lets the agent self-correct. A bare500teaches it nothing. - Make read and write tools obviously different. We name them
get_*/search_*vscreate_*/update_*, and gate the writes (more below).
Step 4 — Memory: less than you think
Most agents need exactly two kinds of memory: the conversation itself (short-term, free) and retrieval over your knowledge (long-term — documents, tickets, product data behind a search tool). Fancy episodic memory architectures are rarely the bottleneck. What matters more is context discipline: summarise or truncate old tool results so the loop doesn't drown in its own history. Long, cluttered contexts degrade reasoning well before you hit the token limit.
Step 5 — Guardrails are the product
An agent without guardrails isn't bold — it's just untested.
The layers we ship with every agent, in order of importance:
- Permission boundaries. The agent's credentials can only touch what it should. Enforced in infrastructure, not in the prompt — prompts are requests, IAM is law.
- Human approval on irreversible actions. Sending money, deleting records, emailing customers: the agent prepares, a person confirms. Relax this only with evidence.
- Step and budget caps. Cap loop iterations and spend per task. A confused agent should fail fast and escalate, not retry for an hour.
- Output checks. Validate structure, scan for leaked PII, and verify claims against the tool results actually returned — that last one quietly catches most hallucination.
Step 6 — Evals before launch, traces after
Build a test set of 30–50 real scenarios — including the ugly ones: ambiguous requests, missing data, users trying to break it. Score outcomes ("was the ticket resolved correctly?"), not vibes. Run it on every prompt or tool change; agents regress in ways code review won't catch.
In production, log every step of every loop — the full trace of thoughts, tool calls and results. When an agent misbehaves (one will), the trace turns a mystery into a bug report.
The failure modes nobody warns you about
- The over-eager agent acts when it should ask. Fix the tool descriptions and add confirmation gates — don't just plead in the system prompt.
- The loop that never ends — retrying a failing tool forever. Step caps plus error messages that say why the call failed.
- The silent degrader — a provider model update subtly shifts behaviour. Your eval suite is the smoke alarm; without one you find out from customers.
- The demo–production gap — flawless on five happy-path examples, lost on real input. Only real-world evals close it.
Start with one narrow, verifiable, low-blast-radius task. Ship it with guardrails, watch the traces, widen the scope as it earns trust. That's the unglamorous version — it's also the one that works.
Want an agent in production, not in a deck?
Our AI Studio designs, builds and operates production agents — model selection, tools, guardrails, evals, the lot. Handcrafted, not generated.