Most organizations measure AI adoption by which models or tools they use. That tells you almost nothing about whether those tools produce reliable outcomes. The teams winning with AI are not the ones with the best models. They are the ones who have stopped treating AI output as a deliverable and started treating it as an unverified input to a system that catches errors.
The discipline does not disappear — it relocates.
This post introduces The Harness Model, a maturity matrix for AI engineering practices across ten dimensions. It is not a certification framework or a prescriptive roadmap. It is a diagnostic tool — a structured way for engineering teams to assess where they are and decide where to invest next.
The visual format is inspired by the Cloud Native Maturity Matrix by Container Solutions: same dimension-based grid approach, applied to AI engineering practices. This is a Q1 2026 snapshot. The field is moving fast. We expect to revise this quarterly as practices evolve and community feedback arrives.
The Harness Thesis
A convergence is happening across the AI engineering community: when AI writes the code, the craft shifts to designing the system that controls AI. Multiple practitioners have arrived at this insight from different angles. The individual concepts — harness engineering, context engineering, loop progression — have been described separately. Our contribution is the synthesis: a structured assessment model that maps where a team stands across all these dimensions simultaneously.
A harness is the system of context, constraints, and verification that wraps an AI agent. It is what turns raw model output into something you can trust. The harness includes the documentation agents read, the linters that enforce boundaries, the tests that catch regressions, and the feedback loops that improve all of these over time.
Mitchell Hashimoto described the core practice: anytime an agent makes a mistake, you engineer a solution so it never makes that mistake again. He called this harness engineering. OpenAI's team reported a striking demonstration: over five months, three engineers used Codex to build a million-line codebase with zero manually written code — a dogfooding result from the team that built the agent, not yet reproduced by customer teams at that scale. Their harness had three layers — context engineering that made the repository itself the system of record, architectural enforcement through custom linters and structural tests, and garbage collection agents that fought entropy before it compounded.
Birgitta Boeckeler observed that harnesses might become the new service templates — standardized starting points for common application topologies. She also raised the critical brownfield question: which harnessing techniques work for existing applications, and which only work for systems built with a harness in mind?
Chad Fowler placed this in historical context: when XP replaced phase-gate development, rigor relocated from documents to tests. The formula now: probabilistic inside, deterministic at the edges — let AI generate freely within boundaries that are strict, explicit, and mechanically enforced.
Kief Morris mapped the progression of how humans relate to agents. Start outside the loop (vibe coding — agent runs, human hopes). Move to in the loop (human reviews every line). Then on the loop (human builds the harness that controls agents). Finally, the agentic flywheel (human directs agents to improve the harness itself). Each step shifts human effort from execution to system design — from writing code to writing the system that writes code.
Taken together, the five practitioners describe different entry points into the same conclusion: reliable AI outcomes require a designed system around the agent, not a better agent.
The better your harness, the more you can trust the output. The worse your harness, the more time you spend reviewing, fixing, and re-prompting — which is often slower than writing the code yourself. METR's 2025 RCT found experienced open-source developers using AI tools were 19% slower on average, while believing they were faster. The harness thesis is exactly what explains that gap: without the system around the agent, the review-fix-reprompt tax dominates.
One principle runs through every stage of this model: accountability stays with the human. Maturity is not a transfer of responsibility to agents; it is the humans earning the right to delegate more execution because the system around the agents has been engineered to catch what they get wrong. When an agent ships a bug, it is still a human who answers for it — and that framing should shape every decision about where to invest next.
A related principle shapes the matrix itself: agents are first-class team members. If a piece of information — a backlog item, an architectural decision, a runbook, a metric — is available to humans but not to agents, the harness has a hole in it. "It sits in my slide deck" is the anti-pattern. Everything the team works from should be equally reachable by the humans and the agents on that team.
The Maturity Model
The Harness Model defines five stages across ten dimensions. Each stage describes a dominant interaction paradigm between humans and AI agents — not a maturity score to optimize. Most organizations today are at Stage 1 or 2. Stages 4 and 5 are aspirational endpoints, not expected norms. Stage 5 is our informed speculation — no validated production examples exist yet, and we included it because extrapolating where this is heading was too interesting to leave out.
We organize the ten dimensions into four clusters that form a causal story: Foundation (inputs — what agents work with), Governance (constraints — what keeps agents safe), Delivery (execution — how work gets done), and Outcomes & Learning (measurement and evolution — how results are assessed and how the system improves). Reading the matrix top-to-bottom follows this causal flow — inputs feed into constraints, which shape execution, which produces outcomes that feed the next iteration — while reading left-to-right shows progression through the five maturity stages.
A candid note before the table: we typically start greenfield teams at Stage 3 and have reached some Stage 4 characteristics in select dimensions — never all at once. If that sounds modest, it is meant to.
| Dimension | No AI Process | Chatbot-Assisted | Human-in-the-Loop | Systematic Harness | Agentic Flywheel |
|---|---|---|---|---|---|
| Context Engineering | N/A | Copy-paste snippets into chat | Human maintains AGENTS.md; manual context loading | Repository as system of record; progressive disclosure | Human designs context structure; agents maintain and evolve it |
| Team (Humans + Agents) | Large specialist team; humans execute; standard tools | Same specialist team; humans execute with AI suggestions | Leaner delivery-minded team; agents use file/terminal | Small generalist team per initiative; agents run full stack | Same generalists span multiple initiatives; agents self-provision |
| Security & Trust | No AI-specific security concerns | Human reviews AI output for obvious issues | AI output runs through existing SAST/DAST tooling | Scoped, auditable agent access; AI-specific threat modeling | Agents enforce and evolve policies; human governs trust boundaries |
| Architectural Governance | No AI constraints | Basic rules in prompt files | Human-enforced boundaries; constrained solution space | Custom linters with remediation; taste invariants as code | Harnesses as org-wide templates; human defines taste |
| Human-Agent Interaction | N/A | Chatbot Q&A | Agent generates; human reviews every output line-by-line | Human on the loop; builds harness | Agents propose harness improvements; human directs evolution |
| Workflow & Process | No AI in workflow | Occasional AI for specific tasks | Daily agent use; human delegates parallel work | Always-running agents; agent-first; human sets priorities | Agents handle full delivery cycle; human steers outcomes |
| Reliability & Operations | Manual runbooks; engineers triage and mitigate incidents | Same ops team; agents draft queries; humans mitigate | Structured runbooks; agents analyze alerts; humans mitigate | Agents auto-triage low-severity alerts; humans handle critical incidents | Agents auto-remediate known failures; humans own novel incidents |
| Verification & Quality | Manual testing | Agent runs tests; human checks | Custom linters and structural tests; human reviews results | Agent-to-agent review with quality scoring; human defines criteria | Agents detect and fix regressions; human defines standards |
| Knowledge & Feedback Loops | Tribal knowledge; docs in wikis | README-level docs | Human structures docs in repo; retrospectives capture agent lessons | Versioned plans and quality grades in repo; agent failures feed the harness | Agents maintain docs and capture learnings; human curates strategy |
| Planning & Decision-Making | Manual boards; experience-driven decisions | AI helps write tickets and research options | Agent triages issues; human prioritizes; decisions validated via PoCs | Decision signals equally reachable by humans and agents; humans validate with cheap prototypes | Agents propose initiatives; humans set direction from measured results |
Most teams today are at Stage 1 or 2. Stage 4 is aspirational; Stage 5 is informed speculation with no validated production examples yet.
Foundation: Context and Team
Context Engineering is how effectively agents receive the right information at inference time. The progression moves from copy-pasting snippets into a chat window to making the repository itself the single source of truth — with progressive disclosure (agents start with a map of the repo and pull in detailed documents only when a task needs them, rather than loading an encyclopedia up front). The key transition: at Stage 3 a human manually curates AGENTS.md. At Stage 4, the repository structure is the context, and documentation linters enforce that it stays current.
Team (Humans + Agents) describes the team as a single composition of humans and agents, because treating agents as second-class citizens is exactly the framing this model rejects. The row moves along two axes simultaneously: team shape (large specialist → business-oriented delivery → small generalist per initiative → generalist across multiple initiatives) and agent capability tier (none → suggestion-only → file/terminal access → ephemeral full-stack environments with browser automation and local observability → agents self-provisioning their own environments). Early pioneers built the full-stack environments from scratch — OpenAI's integration of the Chrome DevTools Protocol, which lets agents drive a real browser alongside code execution, is one example.
The human role evolves along this dimension from execution, to delivery leadership, to multi-initiative oversight. A nuance worth stating plainly: "AI provides specialist depth on demand" is only half true. Agents genuinely provide technical depth — frameworks, patterns, languages — on demand. They do not replace domain depth: the unique business rules, customer workflows, and market dynamics that give a product its edge. In complex-domain businesses, architects need to move toward close customer relationships as agents absorb execution work, not away from them. This is where the "human role evolves, never disappears" principle is most visible.
The Stage 3→4 jump is where both axes move together: teams get leaner because agent capability crosses the full-stack threshold. The trade-off is real — you pay for that leaner team with investment in the platform agents now run on (ephemeral environments, browser automation, observability) and in the governance that makes that leverage safe. Teams that try to shrink without paying either cost end up with the same number of humans and less agent trust.
Governance: Security and Architecture
Security & Trust governs how agent access, AI-generated code security, and supply chain integrity are managed. The progression follows a Zero Trust model: start with no AI-specific concerns, move through existing security tooling — SAST (Static Application Security Testing) and DAST (Dynamic Application Security Testing) applied to agent output just like human-written code — and eventually reach scoped, auditable agent access with AI-specific threat modeling. At higher stages, supply chain verification for agent tools becomes critical — treat tool descriptions and AI skills as potential attack vectors, not just convenience features.
Architectural Governance defines how constraints and design taste are enforced on agents. We use taste in the sense of refined judgment about trade-offs: which abstractions age well, which coupling is acceptable, which patterns reduce long-term risk. In an AI era where generation is cheap, taste is emerging as the new moat — an intangible, durable advantage that survives when everyone has the same models. "Taste invariants as code" means encoding that judgment into linters and structural tests so agents cannot unknowingly violate it.
Basic prompt rules give way to human-enforced boundaries, then to custom linters with remediation instructions injected directly into agent context. The OpenAI team's practice is instructive: they enforce module boundaries and data shape validation mechanically, but leave implementation choices to the agent. At Stage 5, harnesses would become organizational templates — the new golden path — and the human's central job would be defining the taste the harness enforces.
The Stage 3→4 jump is where the cost shows up: turning tribal taste into enforced linters requires senior engineers to name and codify judgment they have always held implicitly. That is slow work, and teams that skip it end up at Stage 3 forever — agents inside boundaries the humans cannot articulate, held together by constant review.
Delivery: Interaction, Workflow, Reliability & Operations
Human-Agent Interaction is the division of labor and trust boundary between humans and agents. This maps directly to Morris's loop progression: from chatbot Q&A (outside the loop) through line-by-line review (in the loop) to building the harness (on the loop) and eventually directing agents that improve the harness (flywheel). The common failure at Stage 3: teams that review every line of generated code become a bottleneck and conclude AI "doesn't save time." Remember that accountability stays with the human at every stage — the loop describes where the human engages, not whether they are on the hook.
Workflow & Process describes how humans and agents coordinate daily work. Hashimoto's journey is a practical illustration: from occasional chatbot use to "always have an agent running." The progression from "occasional AI for specific tasks" to "agent-first delivery" changes what engineering management looks like. At Stage 4, the question is not "should we use AI for this?" but "why would a human do this instead of an agent?"
Reliability & Operations covers what happens after deployment — runbooks, alerts, incident triage, mitigation, and remediation. If Verification & Quality asks "did we build it right?", this dimension asks "is it running right?"
Two axes progress together along this row. The runbook axis moves from tribal knowledge, to markdown-in-Git, to LLM-ready structured runbooks, to agent-executable playbooks — a senior SRE's ability to read dense observability queries becomes a shared capability as agents draft and simplify queries for the rest of the team. The alert-handling axis moves from manual query writing, through AI-drafted queries, to auto-analysis on alert that surfaces initial context within minutes, to severity-based auto-triage with human checkpoints on P1/P2 incidents, and finally to proactive anomaly detection that mitigates before customer impact.
The human role shifts from executing incidents, to mitigating with AI assist, to handling only critical incidents, to owning only novel failure modes. The Stage 3→4 jump here is not "agents get smarter" — it is "agents get trusted with low-severity triage end-to-end." That is a governance decision backed by better runbooks and observability, not blind delegation, and it reflects the principle that maturity is trust earned, not control removed.
Outcomes & Learning: Verification, Knowledge, Planning
Verification & Quality is how correctness is judged in agent-produced output. Early stages rely on human eyes. Later stages shift to programmatic verification — custom linters, structural tests, agent-to-agent review with quality scoring. Fowler's principle applies here: if generation gets easier, judgment must get stricter. A team at Stage 4 has encoded "what good looks like" into tooling, not tribal knowledge.
Knowledge & Feedback Loops tracks how teams capture, maintain, and evolve shared knowledge from agent interactions. The critical shift is from docs in wikis to versioned artifacts in the repository. At Stage 4, plans are first-class artifacts with progress and decision logs. Agent failures generate systematic feedback that improves the harness. Important data points — quality grades, benchmark results, regression signals — live in the repo so results from one agent iteration can be compared against previous iterations. Knowledge that lives only in Slack threads or people's heads effectively does not exist for agents.
Planning & Decision-Making shifts from heuristic-driven decisions and manual ticket boards to data-informed, continuously validated approaches. The underlying principle is equal access: every signal a human uses to make decisions — the backlog, metrics, logs, traces, fitness functions, customer feedback — must be equally reachable by agents. At Stage 4, agents propose and score items against those signals; humans validate the important decisions by building cheap prototypes and measuring real outcomes. The shift is from "we think this is right" to "we built something small and measured."
As team composition becomes more generalist, the Outcomes & Learning cluster is what prevents that generalism from becoming shallow: the measurement and feedback machinery is where taste is refined and where agents earn the trust they are given.
Brownfield Reality Check
All the source articles lean greenfield. Stages 1 through 3 apply equally to brownfield and greenfield systems — chatbot use, human review, and basic linting work regardless of codebase age. Divergence starts at Stage 4.
First, a definition. When we say "brownfield" in this article we mean the common case: a legacy codebase carrying real tech debt, with patchy documentation, weak automation, and processes shaped more by history than design. A well-maintained long-lived codebase is a different beast. A well-maintained codebase can accelerate agents — its established templates and copy-paste patterns give them more to imitate than a greenfield project where everything is built from scratch. To be clear, we believe that the blocker is not age, but rather the lack of a mature codebase (debt + disorder).
Systematic harness engineering assumes a level of architectural cleanliness that debt-laden legacy codebases rarely have. Module boundaries need to be clear. Dependency directions need to be enforceable. Test coverage needs to be sufficient for agents to verify their own work.
Harness-readiness checklist for brownfield systems. Before building a harness on top of a legacy codebase, work through these three questions:
- Modularization: Can you name, in a sentence, what each module owns? If ownership is ambiguous, agents will generate code that belongs nowhere and everywhere.
- Boundary enforcement: Are module dependencies explicit and directional? If not, agents will create new coupling faster than humans can untangle it.
- Test coverage: Can agents verify their own changes without human intervention? Gaps in coverage become gaps in agent autonomy.
Teams that skip harness-readiness and jump straight to agentic workflows typically regress within weeks — agent-generated entropy compounds faster than manual cleanup can address it. This is why Stage 4 remains largely unvalidated for brownfield systems as of Q1 2026. What we know works: Stages 1-3 apply equally, and AI can accelerate the harness-readiness work itself. Boeckeler's analogy is apt: it is like running a static analysis tool on a codebase that has never had one. You will drown in alerts unless you tame the foundation first.
Mixed Maturity and Regression
We have seen this pattern firsthand: one team scored Stage 4 on Context Engineering — structured AGENTS.md files, repo-level documentation wired into every prompt — yet sat at Stage 2 on Verification, relying entirely on manual code review to catch agent mistakes. In practice, this looked like agents producing architecturally sound code that still broke integration tests nobody ran automatically. Once the team prioritized automated verification — wiring CI checks into the agent feedback loop — their context investment finally compounded: agents could self-correct instead of waiting for a human to spot failures. The weaker dimension was the bottleneck, not the stronger one.
Mixed maturity across dimensions is the norm, not a failure. When prioritizing, start with the dimension that creates the most pain or blocks progress in others.
Regression is also real. Last year, many teams stuck at Stage 2 abandoned AI entirely and regressed to Stage 1. This year, the most common struggle is overcoming Stage 3 — teams have justified fear of giving too much control to AI without sufficient verification and governance in place. The harness is exactly what addresses that fear: you do not hand over control, you build the system that makes delegation safe.
Where Are You? A Conversation Starter
This model is not a certification. It is a diagnostic tool — best used as a team conversation starter, not a solo assessment. Grab a whiteboard, walk through the dimensions, and see where disagreements surface. Those disagreements are the signal. A practical starting point: circulate the matrix to your senior engineers before your next staff meeting. Ask each person to score one or two dimensions independently. Bring the disagreements to the meeting — you do not need consensus, you need the conversation.
Foundation
- What does your agent read before it starts a task? A single
AGENTS.md? Structured documentation? The repo itself? - Is every piece of information your humans rely on equally reachable by your agents, or does some of it live in slide decks, private notes, and DM threads?
- If you removed one specialist from the team, could agents cover the technical gap today? And who is close enough to the customer to cover the domain gap?
Governance
- What can your AI agent access? Is that access scoped and auditable, or "whatever the developer's token allows"?
- If an agent violates an architectural boundary, how long before someone notices? Minutes, days, or the next quarterly review?
- Which pieces of your design taste are encoded as linters or structural tests, and which still live only in senior engineers' heads?
Delivery
- How much of your day involves an agent running on a task? Zero percent? Ten? Fifty?
- When an incident fires at 3am, does an agent surface initial context within minutes, or does a human start from a blank terminal?
- When an agent ships something that breaks in production, is it clear that the human on call still owns the outcome?
Outcomes & Learning
- Could a new team member find your verification criteria in code, or only in someone's head?
- Where does your team's knowledge live — in Slack threads, wiki pages, or versioned artifacts in the repository?
- Are the signals you use to decide what to build next — metrics, traces, fitness functions, customer feedback — equally available to your agents?
If you are not sure which stage your team is at for a given dimension, that ambiguity is itself a signal. It usually means you are between stages — and the uncertainty points to where investment would have the most impact.
As mentioned earlier, we typically start greenfield teams at Stage 3 and have reached some Stage 4 characteristics in select dimensions. For brownfield teams, getting to Stage 2 is usually straightforward, but moving to Stage 3 takes hard work from the whole team — and so far we have not progressed beyond that for brownfield systems. If that sounds familiar, you are in good company.
Start Where You Are
The harness is the through-line. Whether you are writing your first AGENTS.md or building custom linters with remediation instructions, you are doing harness engineering. The question is not whether to build a harness. It is how deliberately you build it — and remembering, at every stage, that the accountability for the outcome still lives with a human.
This is a Q1 2026 snapshot — a starting point that will evolve as practices mature and community feedback arrives. Pick one dimension where you feel the most pain. Identify what stage you are at and what the next stage looks like. Build one thing that moves you forward. That is harness engineering. If your team is just getting started, our guide on how to start with AI-assisted development covers the practical first steps.