preloader
blog post

Function Calling and Tool Use: Patterns That Survive Production

author image

The tool call is where the agent meets reality

A model that only writes text cannot do much harm or much good. The moment you give it tools, function calling, the ability to invoke your code with arguments it chose, it becomes an agent that acts on your systems. That is the whole value, and it is also where the hard problems live. Most of the agent bugs we have chased, and most of the security findings we have written about , sit at the boundary where a tool call crosses from the model’s intention into your infrastructure.

The patterns below are the ones that separate a demo from something you can run in production. None of them is exotic. All of them are the kind of thing you learn the second time, after the first version surprised you.

Design the tool like an API, because it is one

A tool definition is a contract you are handing to a caller you do not fully control. Give it the same care you would give a public endpoint. Name it for what it does, type its arguments, and constrain them where the constraint is real, an enum instead of a free string, a required field instead of a hopeful one. Models fill structured schemas more reliably than they fill prose instructions, so the schema is doing real work, not decoration.

The description deserves special attention, and not only for clarity. The model reads the description as an instruction and acts on it. If any part of your tool surface comes from outside, a third-party server, a plugin, an integration someone installed, then those descriptions are untrusted input that reaches the model’s reasoning directly. We covered the full version of this in tool poisoning . The short form: write your own descriptions carefully, and treat anyone else’s as data to be reviewed, pinned, and re-checked.

Make every tool safe to call twice

Production means retries. A network blip, a timeout, a transient error, and your runtime runs the tool again. If the tool writes data, sends a message, or moves money, a blind retry doubles the side effect. We hit exactly this in our own CLI, where the retry path would re-run a tool on any error, including the failures that should never be retried.

Two rules keep this sane. First, only retry errors that are actually transient, and classify them rather than retrying everything. A bad request is not going to succeed on the third attempt. Second, make mutating tools idempotent where you can, with an idempotency key or a check-then-act, so that a retry is a no-op instead of a duplicate. A read-only tool can be retried freely. A tool with side effects has to earn it.

Treat parallel and sensitive as different categories

Agents are faster when independent tool calls run at once, and read-only tools are the natural candidates: search, read a file, query a record. Mark those parallel-safe and let them fan out. A tool that changes state is a different animal. Running several state changes concurrently, or interleaving them with reads that assume a stable world, produces the kind of race that only shows up under load. Keep mutating tools serial unless you have a specific reason and a specific safeguard.

The sensitive category is separate again. Anything that exfiltrates data, reaches outside the perimeter, or is hard to reverse should sit behind a confirmation, not behind the model’s judgment. The model deciding it is fine to run a command is not where you want the only gate, because by then the decision is made. Put a human, or an explicit policy, at the boundary where the action actually happens.

Surface failures as data, not as silence

When a tool fails, the result the model sees matters. Return the error as a tool result the model can read and recover from, with enough detail to choose a different approach, rather than swallowing it or crashing the loop. An agent that is told “that path does not exist” can try another. An agent that gets an empty result, or a success flag on a failed call, will confidently build on sand. We found a version of this in our own loop, where a failed tool was recorded as a success because the error flag never propagated. Fixing that is unglamorous and it is the difference between an agent that self-corrects and one that compounds a mistake.

Start with bash, promote to dedicated tools when you need a hook

A general execution tool, a shell or a code runner, gives the model broad reach with very little tool surface to maintain. That is the right starting point for breadth. The reason to promote an action to its own dedicated tool is that you need a hook the shell cannot give you: a place to gate it behind approval, a custom way to render it in the UI, an audit record with typed fields, or a parallel-safe marking the scheduler can trust. The rule of thumb is to reach for a dedicated tool when an action needs to be governed, shown, recorded, or parallelized, and to leave the rest to general execution.

Do not put a hundred tool schemas in the context

As the tool set grows, loading every definition into every request gets expensive and starts to confuse the model about which tool to pick. Beyond a couple dozen tools, discovery beats enumeration: let the agent search for the tools relevant to the request and load only those, rather than carrying the whole catalog every turn. This keeps the prompt small, keeps tool selection sharp, and lets a large integration library scale without taxing every call.

Log the surface, not only the conversation

If something goes wrong, you will want to answer two questions: which tools could the agent reach, and which did it call with what arguments. Both belong in the audit trail. The conversation alone does not tell you that a tool description changed last Tuesday or that a connector was added without review. The tool surface is part of the agent’s behavior, and observing it is part of running the agent responsibly.

The pattern under the patterns

Every item here is a version of the same idea. The tool call is the point where a probabilistic system touches a deterministic one, and the engineering is about making that boundary safe, observable, and recoverable. Type the contract, make actions safe to repeat, separate the dangerous from the routine, return failures the model can use, gate what should be gated, and keep a record of the whole surface. Get those right and tool use becomes the dependable core of an agent. Get them wrong and it becomes the place every incident report points to. This is the work we put into the tool layer of the Calliope CLI and Workbench, because in an agent, the tool call is not a detail. It is the product.

Related Articles