Agentic DevOps
How a squad of agents takes a refined milestone from inbox to merged code
Last month a backend developer on our team pasted a Linear project URL into their AI tool. The pipeline picked up every issue in the active milestone and spun up one squad per issue, in parallel. Each Developer agent read its issue, already refined, with the why, the approach, the patterns to reuse, and the tests all in it. It loaded the documentation the issue referenced and started building. There was no planning round and no approval step: the issues had arrived as detailed specs.
The squads executed in parallel. PRs opened independently. Two review agents per PR ran concurrently: one checking acceptance criteria, one checking code quality. When both reviewers approved a PR, the squad merged it. One PR got a flagged error-handling inconsistency; the Developer agent fixed it and the second pass auto-merged.
Same day, the milestone was done. The developer barely touched it after pasting the URL. The work that used to fill their day had already happened upstream, during refinement, which was the topic of our previous article.
If the build no longer needs a human steering every step, what is the new shape of the pipeline, and what new failure mode does it produce?
This is the third piece in a five-part series. The first argued documentation is context infrastructure. The second showed how product management changes when refinement happens in real time with agents. This piece is about what those refined milestones feed into: a build pipeline where most of the work between refined milestone and merged code is done by agents, not humans.
DevOps moved up a layer
For a long time, DevOps mostly meant automating deployment. Pipelines, infrastructure-as-code, observability: the work that happens after code is written.
The work that happens before code is written, taking a feature description and producing an implementation, has remained almost entirely manual. A developer reads the spec, asks questions, plans the approach, writes the code, tests it, opens a PR, addresses review comments, and merges. That whole sequence has had no equivalent of CI/CD.
Agents change that. Replacing the developer was never the point; that framing misses what’s actually happening. What agents take over is the part of the build that used to need continuous human steering. The developer no longer drives every step, but hands off a refined milestone and watches it land.
We use the term agentic DevOps to describe this. It’s the build process for the build process: the pipeline that takes a refined milestone and produces working, reviewed, merged code with the same kind of structural reliability that CI/CD brought to deployment.
The old DevOps layer still runs unchanged. Agents produce merged code; CD takes it from there. We don’t give agents direct write access to infrastructure: every deploy, every migration, every terraform apply goes through the same pipeline humans always used, with the same gates and approvals it always had. The point is containment. An agent that hallucinates a destructive command can’t execute it, because the only path from main to production is the CD pipeline, and the agents don’t have that path.
The squad
When a milestone enters the pipeline, one squad spins up per issue. Each squad has the same composition; they run in parallel.
The Developer agent does the work. It reads the refined issue (the approach, the patterns to reuse, and the dependencies were all identified during refinement), loads the documentation the issue references, and writes code that follows those patterns.
The QA agent writes the tests. The BDD scenarios in the refined issue are already the test specification, with each acceptance criterion mapping to one or more tests, so the QA agent turns those scenarios into runnable tests as implementation proceeds.
The Code Reviewer agent examines the PR after implementation. It checks pattern consistency, flags architectural drift, and verifies that conventions are followed. It cannot edit the code. Its tool access is read-only by design.
The QA Specialist agent also examines the PR, but for a different question: do the tests actually verify the behavior described in the BDD scenarios? It’s possible to write tests that pass without actually testing the right thing. The QA Specialist catches that.
These agents run independently. They don’t coordinate with each other directly. They coordinate through the artifacts they produce: the refined issue, the code, the tests, the review comments. Across squads the same is true: each squad’s PR is independent of the others. Communication happens through commits and PR comments, not through cross-agent messages.
Tests as the spec
Acceptance criteria in our Linear issues are written in Gherkin: given the user is on the dashboard, when they click export, then a CSV downloads with the last 30 days of activity. The format is deliberate. Anyone can read it: a product manager, a designer, a customer success representative, anyone who shouldn’t have to read code to understand what we’re shipping.
The QA agent turns those scenarios into tests. Each criterion becomes one or more tests, written as implementation proceeds, and the Developer agent has to make them pass. This is TDD with the test moved up into the issue itself. The spec, the test, and the acceptance criterion become one artifact, owned by whoever wrote the issue.
This discipline matters more once agents write the code. When a human wrote it, the test checked work that already had judgment behind it. When an agent writes it, the test is the only thing that says what the code should do. If the criteria aren’t readable outside engineering, no one can check the agent’s work, and all that’s left is trusting the agents got it right.
So readability is a hard requirement. Every acceptance criterion has to be verifiable by someone with no access to code, logs, or the database. A criterion that fails this test gets rewritten before the issue enters the pipeline. Projects and milestones follow the same rule: a project description says what done looks like for the whole effort, in the same plain language as the per-issue scenarios.
Readable criteria do more than let people check the result. When the whole team can read what a feature should do, getting it right stops being engineering’s job alone. A product manager, a designer, or a customer success representative can catch a wrong scenario while it’s still a sentence, before it becomes wrong code. We think this shared understanding is what keeps a product sound over time: the definition of done lives where the whole team can see it and fix it, not inside one engineer’s head.
Why there’s no plan step
You might expect a planning phase here: the squads read their issues, produce an implementation plan, a human approves it, and only then does code get written. There isn’t one. The planning already happened upstream, during refinement.
By the time a milestone enters this pipeline, every issue in it is a complete specification. The previous piece described how those get built: the why, the who, the technical approach the Architect agent settled on, the patterns the Developer agent found in the codebase, and the BDD scenarios the QA agent turned into a test plan. The plan is the issue. The squads execute it instead of working it out again.
This is the structural payoff of front-loading judgment into refinement. The expensive thinking (is this worth building, how should it fit the system, what does done look like) was done once, with a human in the room, and written down. The build pipeline inherits those decisions rather than re-making them. It also means there’s no human checkpoint between a refined milestone and merged code. We chose that deliberately, and we come back to it below.
Implementation
Each Developer agent implements its issue. It writes code in the patterns the refined issue points to: the same function structure, the same error-handling approach, the same database conventions. Tests get written against the BDD scenarios from the issue.
Linting runs automatically on save. Tests run before commit. Format checks run on every file write. These guardrails operate regardless of agent reasoning. An agent can’t skip them.
When the context infrastructure is precise and the codebase is clean, this is where it clicks: close to one prompt, one feature, no rework. The agents have everything they need (the what, the why, the how, the test plan, the patterns to follow) and produce complete, convention-compliant implementations across the milestone.
When the context infrastructure is messy (stale ADRs, ambiguous conventions, inconsistent patterns in the codebase), implementation drifts. Agents pick one of several plausible interpretations and run with it, and the developer spends time correcting the choice. That looks like an agent failure, but it’s a context failure, and implementation is just where it surfaces.
That’s why the layer underneath this piece matters. Agentic DevOps doesn’t work without the documentation layer piece one described, or without the shared configuration layer we’ll cover in the next piece. The reliability of this pipeline is downstream of the precision of those layers.
Review and merge
When implementation completes, each PR opens. Two review agents examine it in parallel.
The QA Specialist validates that the implementation matches the acceptance criteria. Does each criterion have a corresponding test? Do the tests actually verify the behavior described in the scenarios? Are there acceptance criteria the implementation missed entirely?
The Code Reviewer checks code quality and convention compliance. Do the implementation patterns match the codebase? Are there architectural violations that conflict with existing ADRs? Is the code consistent with the approach the refined issue laid out?
Both reviews happen in parallel and produce comments on the PR. The Developer agent reads the comments, fixes valid issues, and pushes new commits. The cycle repeats until both reviewers approve.
When both reviewers approve, the squad merges. There’s no human merge checkpoint, and for feature work no human checkpoint anywhere in the pipeline. The issue was refined and signed off upstream; the reviewers agreed; the merge happens. Across the milestone, PRs land independently as each one passes review.
The bug-fix variant
The same pipeline handles bug fixes, but the unit is one issue at a time, not a milestone, and the trigger is different.
When a Sentry alert fires, Sentry’s Linear integration auto-creates a Linear issue from the alert. Sentry’s Slack integration posts the alert to a triage channel at the same time, with the Linear issue link attached. From Slack, a human triggers the squad to start work on that issue.
That trigger is the human moment in this variant. The auto-created Linear issue isn’t refined yet: it’s a stub with the stack trace and the alert metadata. Triggering from Slack tells the squad to take the stub into the pipeline. Here the asymmetry with feature work shows. A feature issue arrives refined, so the squad goes straight to implementation. A bug stub isn’t refined, so it needs a diagnosis pass first: an observability-flavored step produces a fix hypothesis and a fix plan, implementation writes the fix, review verifies it, and the squad auto-merges.
The Slack trigger is also the noise filter. Sometimes the alert is real and the hypothesis lands. Sometimes the alert is noise, or the root cause is elsewhere than where the stack trace points. The human in Slack decides whether to trigger the squad in the first place. After that, it’s the same pipeline.
We’re early on this variant. We’ve shipped a small number of agent-fixed bugs this way. The pipeline works; the volume isn’t yet enough to claim it scales.
Where humans are still central
For feature work, the build pipeline has no human checkpoint at all. That sounds reckless until you remember where the judgment went: upstream, into refinement. By the time a milestone reaches this pipeline, a human has already decided the work is worth doing, agreed the approach, and signed off on what done looks like for every issue. The we shouldn’t do it that way because of X conversation, the kind of domain judgment that’s hard to encode in documentation, happens during refinement, while the issue is being shaped, well before any code is written. The build pipeline inherits a decision that was already made.
So the human moments left in the pipeline itself are about bugs and bad outcomes, not features in flight:
Slack trigger for bug fixes. Sentry alerts auto-create Linear issues, but a human chooses to trigger the squad from Slack, framing the issue with whatever local context the responder has. It’s also where a bug stub gets the diagnosis a refined feature issue already carries.
Recovery from bad merges. This is the one human moment the pipeline doesn’t schedule. When something slips through both reviews, we deal with it after the fact, since the flow has no built-in stage for catching it. What that costs, and whether it holds at larger volume, is the open question we come back to below.
Everything else (implementation, testing, reviewing, merging) is agent-driven. That’s a lot of work that used to require continuous human steering and now happens on its own.
The honest framing: humans are still in the loop, but the loop is shorter and sits at higher-leverage points. The human decision concentrates in refinement. After that, auto-merge runs on the trust we’ve placed in the two reviewers.
The harness, not the model
We use Claude Code for the squads. The configuration is ergonomic, the agent definition format is concise, and having one provider behind every agent keeps coordination simple. But the tool is the incidental part. What the last several sections actually described is a harness: the squad composition, the read-only reviewers, the guardrails that run on save and before commit, the refined issue that feeds all of it. Birgitta Böckeler calls building this harness engineering (https://martinfowler.com/articles/harness-engineering.html), and frames the agent itself as Model + Harness: the model does the reasoning, and the harness is everything around it, the guides that steer an agent before it acts and the sensors that catch it after.
We think this is where software engineering is heading. The durable craft is less about prompt-wrangling or picking the best model this month, and more about harness engineering: building the bespoke scaffolding that turns a general model into a reliable teammate for your codebase. You rent the model and swap it as better ones appear. The harness is the part you own and keep improving.
And because the harness is separable from the model, the model becomes the part you optimize. With OpenCode or a similar orchestration layer, you can craft a tool that assigns a different model per role: a top-tier reasoning model like Opus 4.8 for the roles where judgment is expensive, the Architect weighing feasibility or the QA agent designing the test strategy; Sonnet for the Developer agent writing the code, which is largely mechanical once the spec is precise; a smaller, cheaper model for the Slack-notification agent that just formats text; a local model for any agent that handles sensitive data and can’t leave your network. Mixing public and local models this way gives you a real handle on cost, performance, and provider lock-in, and you get it without rebuilding anything: each role keeps its place in the harness and simply runs on a different model.
What we don’t know yet
Failure modes we haven’t hit at scale. Our pipeline runs cleanly for clear, well-scoped milestones, and two things keep it that way. Ambiguity is resolved upstream, at refinement, before an issue ever enters the pipeline. And the complexity score keeps issues small: a change that would sprawl across dozens of files gets split into separate issues rather than handed to one squad, which is what keeps any single squad’s blast radius contained. What we haven’t tested is where that slicing breaks down: changes that resist clean decomposition, and cross-issue dependencies across a milestone that refinement didn’t anticipate. Our scope so far is too small to claim those work.
Bug-fix volume. The bug path is wired: Sentry collects every error and auto-creates a standardized Linear issue, and a squad is triggered to fix it ASAP. But most of our pipeline experience is still feature work, and the volume of agent-fixed bugs remains low. We don’t have enough data to claim it holds across the full range of production-incident conditions.
Recovery from bad merges. With no plan checkpoint and auto-merge on approval, the two review agents are the only gate between a refined issue and production. A PR that passed both reviews but introduced a regression lands without a final human gate. We revert and triage manually when that happens. That works at our scale. We don’t know how it scales, and the regression-cost-vs-merge-speed tradeoff is the most exposed edge of this design.
Cost. Multi-agent workflows consume significant compute. A single milestone involves multiple model calls per issue (a Developer agent, a QA agent, and two reviewers, plus the fix cycles between them), each processing thousands of tokens of documentation context. The economics are manageable for a small team. We don’t have data on what happens at higher volume.
The pipeline’s reliability is real, and we’d defend the basic claim: that agentic DevOps shortens the path from refined milestone to merged code by an order of magnitude. But the limits are real too. This sits earlier in the stack than deployment automation, and it’s less stable.
Next in this series, in two weeks: AI Config Is Infrastructure. Why your team’s AI assistants should be peer-reviewed, version-controlled, and shared like any other code.




