ENGINEERING

Harness Engineering: the underrated discipline of production AI

Kaushik ASPJune 6, 20265 min read

A million lines of code, zero written by a human hand.

That was the field report from OpenAI’s Ryan Lopopolo in February 2026. That same week, Mitchell Hashimoto (creator of Terraform) named the discipline that made that scale possible: Harness Engineering. They used different words. They described the same skill.

The industry, which had been doing this work without a name for over a year, finally had one: harness engineering. The discourse caught up fast. But the actual practice? Most enterprise teams still don't have anyone whose job it is.

How we got here: prompt → context → harness

The progression matters because it explains why this skill is needed now and wasn't two years ago.

2022–2024: prompt engineering. The skill was writing better instructions, getting a sharper answer from a single model interaction. Most teams got reasonably good at it.

2025: context engineering.The skill expanded to the information environment: what the model sees, how it's retrieved, how the context window is managed.

2026: harness engineering. The lens widened again. The skill is now building and maintaining the entire system that governs how an agent behaves across every run, every session, every deployment — not one interaction, not one context window, the whole environment. Constraints, feedback loops, tool interfaces, subagent architecture, failure recovery, and the mechanisms that make every lesson compound automatically instead of disappearing into a closed ticket.

Three concentric boxes showing how Prompt Engineering sits inside Context Engineering, which in turn sits inside Harness Engineering — each layer wrapping and extending the one before it.

Each step required more engineering discipline than the last. This one requires the most for now.

What practicing this skill actually looks like, technically

Here's what harness engineering looks like at the level where it actually matters.

Reading trajectories for patterns, not bugs

Most teams debug agents the wrong way. They find a bad run, study it closely, write a fix, and move on. This feels productive but produces almost no durable improvement.

The right practice is pattern scanning at volume. You get more signal from skimming a hundred trajectories than studying ten closely. You are not looking for what went wrong in this run. You are building a mental index of what goes wrong across runs. This might include the same tool being misused in ways that individually look harmless, the same subagent mysteriously not getting invoked in a specific class of situations, or the same reasoning step failing whenever the input has a particular shape.

The Discipline: Skim until something feels off and then zoom in. The day after you ship any change (a new prompt, tool, or constraint), read thirty trajectories that hit that code path.

Redesigning tools instead of stacking instructions

The insight most enterprise teams have not fully absorbed: a tool is not an API endpoint. It is a prompt fragment. Every part of a tool interface (the name, the description, the parameter names, the parameter descriptions, and the return format) is text the model reads while deciding what to do. The tool is not separate from the prompt. It is part of the prompt. This means every design decision about a tool interface directly affects agent behaviour.

The name is more load bearing than most teams realise. An agent deciding whether to call a tool is pattern matching on the name against the task at hand. A name that is slightly too generic, too clever, or too close to another tool name will cause misrouting that looks like a model failure but is actually an interface failure. Significant quality improvements have come from nothing but renaming a tool.

Parameter names are instructions rather than labels. A parameter named query gets filled differently than one named search_query or natural_language_question. The model uses the parameter name as a hint about what format the value should take. Name parameters the way you would want them filled.

Descriptions: These should describe when to use the tool instead of just what it does.

Bad:“Fetches weather data for a given city.”
Good:“Use this when the user asks about current conditions in a specific location. Do not use for historical averages.”

The Principle:When a tool is misused, do not “tell the agent harder” in the system prompt. Redesign the tool interface until the correct usage is the structurally obvious one.

Encoding lessons permanently

A failure class that does not get enough attention: prompt systems have no type system. Which creates a dangerous class of silent failures. In traditional software, a missing part causes a crash; in prompt composition, the agent just produces “plausible but wrong” output.

The Problem: Prompt composition has no contract between parts. An addon block might assume a base block is present, but there is no mechanism to verify this.

The Properties of these Bugs:

Silent: No alert fires and observable signals look normal.
Slow: You only find them when a user complains or during a manual review.
Trivial to fix: Usually a one-line change, but the cost lies in the detection time.

The Solution: Treat every prompt composition as an integration test.

Snapshot fully assembled prompts.
Diff them across changes.
Evaluate against semantic completeness to ensure the output reflects the intended context.

Why it's trending but still underrated

The discourse found this skill fast. Hashimoto named it, openAI proved it at scale, Martin Fowler published a formal taxonomy, the academic papers followed. Every serious engineering blog has covered it.

And yet many enterprise teams don't have anyone whose actual job is practicing it. A proper harness engineer on the ground!

The skill gets diffused into existing roles: a senior engineer here, someone who used to do prompt work there, without being recognized as a discipline that requires sustained, dedicated practice to compound.

Teams know they need better agents. They don't always understand that what they need is someone

Doing trajectory reads every day
Redesigning tool interfaces when the pattern shows up three times
Encoding every lesson into something the system inherits automatically rather than something one engineer carries in their head and the work suffers if they take a break.

The concept is trending. The dedicated, compounding practice is underrated.

What happens to teams without it

I've been on ~20 calls with AI engineering teams in the last two weeks. Some of the sharpest people shipping agents in production right now. When I ask what's actually slowing them down, almost everyone lands in the same place eventually.

It usually sounds something like:

We see the failure. We ship a fix. Three weeks later, someone else on the team is fixing the same thing again.

Their observability is working, but only storing traces, not flagging the silent failures. The ones when seen standalone look perfectly fine, but with the right context, you know it is a silent failure.

The lessons from each failure don't come back into the system. They live in a Linear ticket someone closed, a Slack message that scrolled past, or in one engineer's head until that engineer takes a week off.

This is exactly the failure mode that harness engineering as a sustained practice is designed to prevent. The lesson doesn't survive because it was never encoded somewhere the system inherits. It was patched once, by one person, in a way that doesn't propagate forward.

A trace can tell you what broke. An eval can tell you how often. Neither closes the loop back into the agent automatically. Closing that loop: permanently, structurally, in a way that compounds is the skill.

Engineering teams today are doing vibe evals. Evals that represent maybe 80% of their traffic, written against the cases they already understood. What they don't fit is the long tail, and in agent-based products the long tail is where everything compounds. That's the part the harness was supposed to catch, encode, and make structurally impossible to repeat. It never gets the chance because it never makes it into an eval in the first place.

This is the real work on harness engineering. Not the obvious failures, those get caught. It's building the mechanisms that surface what the long tail is doing, and encoding the findings into the system. The encodings that system can inherits permanently. That's what closes the loop.

It's also the thing we're quietly working on at BentoLabs.ai

Learn more about how we are solving the complexities of production AI.

Production AI agents need research infrastructure, not iteration