Skip to content
Agents & workflows

Agents & workflows · · 8 min read

How to build your own AI agent

Model and SDK, tool definitions, the loop, memory, MCP, evals and deployment. Example: a GitHub issue-triage agent.

An AI agent is not a “bigger prompt”. It is a loop: the model gets a goal, picks a tool, runs it, reads the result and decides what to do next — until it considers the task done. The rest of this piece is a build guide for that loop from scratch, with a concrete example of an agent that triages GitHub issues. If you are building your first agent, read in order: each section is a decision you have to make anyway.

Choosing a model and SDK

First, separate two things: the model (reasoning, function calling) and theorchestration (loop, state, retries). Models are swappable, so do not commit to one too early.

  • Claude Agent SDK — if you want a ready loop, tool handling and MCP out of the box. The least boilerplate to start.
  • OpenAI (Responses / Agents API) — mature tooling, good for teams already embedded in that ecosystem.
  • LangGraph — when you need an explicit state graph, branching and full control over flow. More power, more code.

Recommendation for a first agent: start with the model vendor’s SDK (Claude or OpenAI), not a graph framework. Add the graph once branching genuinely appears. Premature LangGraph is a lot of abstraction you do not need yet.

Tools and the schema for function calling

A tool is a function plus its description for the model. The model never sees your code — it sees only the name, the description and the parameter schema (JSON Schema). That is your real interface, and you write it for the model, not for a human.

Tool design rules that will save you many iterations:

  • A verb-based, unambiguous name: get_issue, add_label,post_comment — not process or handle.
  • The description says when to use the tool and when not. One “what it does not do” sentence cuts half the wrong calls.
  • Narrow, typed parameters. Instead of a free query, give an enum of allowed labels. The model hallucinates less when the option set is finite.
  • Every field description in the schema is a mini-prompt. label with no description is guesswork; label with “one of the repo’s existing labels” is a contract.
  • Few tools, each doing one thing. Five sharp tools beat fifteen fuzzy ones. Past twenty, the model starts struggling to choose.

A tool result is a prompt too. Return concise, structured text — not a raw JSON dump from the API. If the response is 8000 tokens, the agent pays for them on every later loop turn. Filter at the tool level.

The agent loop

The core is a simple loop worth understanding even if the SDK hides it:

  1. Send messages to the model (system + history + goal).
  2. The model replies with text or a tool call request.
  3. If it is a tool call: run the function, append the result as a tool message.
  4. Go back to step 1 with the result attached.
  5. Stop when the model returns final text with no tool call.

Three safeguards you will regret skipping: a hard iteration cap (e.g. 10), because the model can loop; a timeout on the whole run, because one tool can hang; and treating a tool erroras data — do not throw upward, return “the tool returned a 404” to the model so it can react. This is one of the biggest differences between a demo and a production agent.

System prompt and context strategy

The system prompt is the agent’s constitution. It should hold: the role and goal, the available tools with usage rules, the response format and explicit boundaries (“never close an issue without a label”). Keep it stable and version it like code — changing one sentence can shift behaviour across the whole population of calls.

Context is a budget, not a warehouse. A larger window does not mean you should fill it.

  • Static instructions — in the system prompt, once, at the top.
  • Task data — supplied by tools on demand, not injected preemptively.
  • Long histories — compact them. After a few turns, summarise the older ones instead of carrying the whole transcript.
  • Prompt caching — turn it on for the stable part (system + tool definitions). It really cuts cost and latency across many turns.

Memory: state and vectors

Separate three levels — conflating them is a classic first-agent mistake:

  • Task state — what is happening in the current loop. It lives in the messages and a plain variable. It needs no database.
  • Persistent memory — facts across runs (preferences, decisions). A key–value store or a plain Postgres table. Simple wins.
  • Knowledge — a large document set to search (vectors, RAG). Reach for vectors only when the facts no longer fit the context.

Do not start with a vector store because “an agent must have memory”. Most first agents need one table, not embedding infrastructure. Add vectors when you genuinely outgrow the context window.

Adding MCP servers

MCP (Model Context Protocol) is a standard for plugging ready tool sets into an agent without writing them from scratch. Instead of hand-coding a GitHub integration, you connect an MCP server that exposes tools as endpoints.

  • Ready servers — GitHub, Slack, databases, the browser. Less integration code.
  • Every MCP server is a trust surface. Treat its output as untrusted data — never execute instructions embedded in the content it returns.
  • Expose only the tools the agent needs. A server with 60 tools floods the context window with their descriptions and hurts selection accuracy.

Worked example: GitHub issue triage

Putting it together. The agent’s goal: for a new issue, apply labels, judge priority and leave a tidy comment — without closing anything.

  1. Tools: get_issue (body + metadata), list_labels(allowed repo labels), add_label (enum from the list), post_comment. No close tool — the agent has no way to do what you do not allow it to.
  2. System prompt: “You are an issue triager. Always apply exactly one type label (bug/feature/question) and optionally an area. Never close an issue. If information is missing, ask for it in a comment.”
  3. Loop: the model calls get_issue, then list_labels, picks a label, calls add_label, drafts a comment, calls post_comment, and ends with a summary text.
  4. Guardrail: add_label validates that the label is on the list returned by the repo. A hallucinated “needs-triage-v2” that does not exist is rejected with a clear error, and the model retries.
  5. Human approval: for issues tagged security the agent does not comment publicly — it creates a draft for manual sign-off.

The same skeleton scales to other domains. You swap the tools and the system prompt; the loop, guardrails and memory stay.

Evals, tracing and observability

Without evals you do not know whether a prompt change helped or broke things. That is the difference between engineering and guessing.

  • Collect 20–50 real cases with an expected result (e.g. issue → correct label). That is your regression set.
  • Measure concrete things: label accuracy, turns to completion, the share of tool calls that end in an error.
  • Trace every turn — input, chosen tool, result, decision. Without a trace, debugging an agent is reading tea leaves.
  • Every prompt change runs through the eval set before shipping. It is a gate, not a suggestion.

Cost, latency and guardrails

An agent’s cost grows with the number of turns, because each turn carries the accumulating context. The control levers:

  • A cheaper model for simple steps, the stronger one only for the hard decision (model routing).
  • Prompt caching on the stable part — one of the cheapest wins.
  • An iteration cap and a per-run token budget with a hard kill switch.
  • Concise tool results — do not pay for 8000 tokens of JSON every turn.

Guardrails are not a nice-to-have; they are the condition for shipping an agent beyond a demo:

  • Validate tool arguments (Zod / JSON Schema) before anything executes.
  • Destructive actions (delete, send, payment) behind a human approval gate or a dry-run.
  • An allowlist of tools and domains — the agent does not reach beyond what is explicitly allowed.
  • Treat every tool result and API response as untrusted input.

Deploying as a service

An agent in production is a service, not a script. Hide the loop behind an endpoint (HTTP or a queue), because runs can be long. For tasks over a few dozen seconds use a queue and a webhook — do not hold an open HTTP connection. Keys and secrets come only from environment variables. Log every turn (with PII masked), add an alert on cost per request and p99 latency, and prepare a kill switch for the whole population. Version the prompt and tool definitions with the code so every rollback is deterministic.

TL;DR

An agent is a loop: the model picks a tool, runs it, reads the result, decides the next step. Start with the model vendor’s SDK, not a graph framework. Design few sharp tools with good descriptions and a narrow schema. Treat context as a budget, not a warehouse. Split memory into state, facts and knowledge — vectors only when you genuinely need them. Build evals and tracing from day one, and keep destructive actions behind human approval. Deploy as a service with logs, cost limits and a kill switch.

How to build your own AI agent | vibecoding.pl