How to build AI agents that are actually good

TL;DR

Agents today are basically just chatbots with a to-do list. Real agents (the kind that reliably plan, act, and execute) need structure, constraints, and continuous evaluation. The recipe is to: write explicit specs → separate planning and execution → ground every action → verify outputs → add human-in-loop for uncertainty. Don’t treat your agent like a science experiment. You have to treat them like production software.

A quick note

Amazon just announced 30,000 layoffs. UPS cut 48,000. Both cited AI as a key driver, and it’s not stopping there.

If you were recently affected and want to pivot into an AI-related role, I want to help. I’m putting together a small effort to connect people with resources, mentors, and companies hiring for AI-skilled roles.

I’ve been in the AI space for 8+ years, worked with ML systems at Meta, founded an AI education non-profit that reached 70,000 people, and now run an AI testing platform where I see firsthand how companies are implementing AI and reshaping their approach to business.

If that sounds useful, you can fill out the form below. I’ll share what I learn as I help people navigate this shift.

Fill out form

Why “good” agents are rare

Building an AI agent isn’t about letting a model “figure things out.” It’s about engineering a controlled system where the model can plan, act, and learn — safely.

Agents fail for predictable reasons:

Ambiguous goals. The model doesn’t know what success looks like.
Unbounded tools. It can call anything, in any order.
No feedback loops. Mistakes compound without correction.
No monitoring. You don’t know when (or why) it broke.

What you get: hallucinated actions, endless loops, or confident wrong behavior.

What you want: a system that can reason, act, verify, and recover all under supervision.

The agent stack (layered like any production system)

1. Spec first — write the job description

Before any prompt or code, define:

Goal: what outcome is the agent responsible for?
Allowed tools: which APIs or actions can it use?
Constraints: what it can’t do (e.g., “never approve payments > $1k”).
Evidence requirements: every action must cite where it got the data.

This becomes your agent spec. Basically, the contract between you and the system.

2. Plan, then execute (never both at once)

Separate planning (thinking) from doing (acting).

Example:
1️⃣ Planner: “I’ll fetch the ledger, check balances, and draft a transfer.”
2️⃣ Executor: runs those steps deterministically through verified APIs.

This separation gives you reproducibility, auditability, and safety.

3. Ground every action in truth

Agents hallucinate just like models do — they make up actions when context is missing.
So ground everything:

Connect to deterministic APIs for facts (ledger, CRM, docs).
Force references: every action output must include evidence_refs.
Ban open-ended calls like “search the web and decide.”

If it can’t verify, it shouldn’t act.

4. Verify before you trust

Use structured outputs and post-checks:

Schema validation (does it match the spec?)
Sanity checks (does the number make sense?)
Approval gates (manual or rule-based for sensitive actions)

Verification slows you down slightly, but it prevents silent failure loops.

5. Evals and monitoring (the secret sauce)

Agents evolve over time, so evals must, too.
Build evals that measure:

Task completion rate — % of plans that finish successfully
Safety violations — unauthorized actions or skipped verifications
Fallback rate — % routed to human review
Mean time to recovery — how fast failures are caught and corrected

Run replay evals on production logs weekly. Agents learn; your evals should, too.

6. Human in the loop (the ultimate safety rail)

Don’t aim for full autonomy first.
Route low-confidence or high-risk tasks to humans — ideally with AI drafting the action for approval.
You’ll get reliability and user trust.

Small example — what good looks like

A “Meeting Notes Agent” that:

Fetches meeting transcript from your DB
Plans a 3-step summary → action list → follow-up email
Executes only if all referenced data is verified
Sends the draft for approval
Logs every step (inputs, outputs, evidence)

That’s an agent you can ship without fear.

Quick checklist

Define a spec: goal, tools, constraints, evidence.
Split planner vs executor.
Add schema + output validation.
Require evidence for every fact.
Route low-confidence steps to humans.
Add daily evals + production logging.
Monitor safety and success metrics.

Final note

Agents aren’t dumb, all they need is some structure and constraints. Structure creates reliability. Constraints create trust. And that’s how you get a good agent.

Agents should be boringly predictable before they’re impressively autonomous.
Start simple, measure everything, and evolve with evals.

👉 If you found this issue useful, share it with a teammate or founder navigating AI adoption.

And subscribe to AI Ready for weekly lessons on how leaders are making AI real at scale.

Until next time,
Haroon