THE SHIFT FROM SUGGESTION TO EXECUTION
The original Copilot model was fundamentally a completion engine: given the cursor position and surrounding file, predict what comes next. It was useful and fast, but passive. The engineer remained the executor. Every line the model proposed had to be accepted, modified, or rejected keystroke by keystroke. The model had no concept of the task — only of the token stream.
Coding agents invert that dynamic. Tools like Claude Code, Copilot Workspace, Cursor's Agent mode, and Devin don't complete lines — they complete tasks. You describe what you want: implement this feature, fix this bug, migrate this API, add tests for this module. The agent reads the relevant files, writes code, runs linters and test suites, reads the output, and iterates. The engineer reviews a diff, not individual lines.
What makes this possible is the combination of large context windows, reliable tool use, and the reasoning models that emerged through 2024 and 2025. An agent that can hold an entire codebase section in context, call a bash tool to run tests, parse the failure output, and reason about what to change next — that's qualitatively different from an autocomplete engine. The architecture changed what the product can do.
WHAT THE BENCHMARK NUMBERS ACTUALLY SHOW
SWE-bench Verified became the standard evaluation for coding agents: a set of real GitHub issues from open-source Python projects, where the agent must produce a patch that makes the failing tests pass without breaking existing ones. In early 2024, the best systems solved around 15–20% of issues. By late 2025, leading agents were clearing 50–65% on the verified subset — a dramatic improvement that still means failing on one in three to one in two real-world issues.
The benchmark reveals both the capability and the ceiling. Coding agents are genuinely good at well-specified bugs with existing test coverage: the failure mode is clear, the fix is localised, and the correctness criterion is executable. They struggle with ambiguous requirements, cross-cutting architectural changes, and issues where the "correct" fix involves understanding domain context that isn't in the code. SWE-bench skews toward the former category, which is why the numbers look impressive while real-world hit rates are lower.
More revealing than headline scores is looking at where agents fail. They tend to over-fit to the visible tests — passing the specified assertions while breaking adjacent behaviour. They sometimes introduce surface-correct patches that paper over root causes. They occasionally loop on incorrect approaches rather than recognising the need to step back and reconsider the architecture. These failure modes are exactly the ones that code review exists to catch, which points directly to where the human still needs to be.
WHERE THE CURRENT GENERATION BREAKS DOWN
Scope creep is the most common failure in practice. Coding agents are optimistic: given a task, they will attempt it, and if the first approach doesn't work, they'll try another. Without explicit constraints, an agent tasked with "fix the login bug" might end up refactoring the session management system, migrating the auth library, and updating half the test suite — all in service of the original fix, but with a blast radius that's difficult to review and risky to merge.
Long-horizon task coherence remains fragile. Agents handle well-bounded tasks cleanly but lose track of state and intent over multi-hour sessions involving many files and tool calls. They may contradict earlier decisions, re-introduce code they just deleted, or forget a constraint that was stated at the start. The practical mitigation is shorter task scopes and explicit checkpointing — treating agents like junior developers who need clear sub-tasks rather than open-ended projects.
Security and correctness in unfamiliar code are genuine concerns. Agents don't know what they don't know. Introduced to a codebase with unusual conventions, implicit invariants, or security-sensitive code paths, they'll apply general best practices that may be wrong in context. A function that looks like a safe string concatenation to a model might be building a query that requires parameterisation. Agents are not currently equipped to flag their own uncertainty in these cases with the reliability that matters for production code.
THE HUMAN-IN-THE-LOOP QUESTION
The practical question for engineering teams isn't "can we use AI coding agents?" — it's "at what points in the workflow does human review provide irreplaceable value?" For most teams in 2026, the answer is: code review of diffs, architectural decisions, and anything touching security, data handling, or external APIs. The agent generates the candidate solution; the engineer evaluates whether it's actually correct and appropriate.
The risk of removing human review isn't the dramatic science-fiction scenario where the agent deploys malicious code autonomously. It's the mundane scenario where a technically correct patch that passes CI introduces a subtle regression, adds a latent security issue, or solves the stated problem while making the codebase harder to maintain. These are exactly the things a competent reviewer catches — and exactly the things an agent optimising for test passage will miss.
Some teams are experimenting with fully autonomous loops: agent writes code, agent reviews its own output, agent opens the PR, human reviews only before merge. This works for small, well-defined tasks in codebases with strong test coverage. It breaks down for anything requiring judgment about what the code should do rather than whether it does what was asked. The line between those two categories is the honest boundary of where current agents operate safely without a human in the loop.
WHAT ENGINEERING TEAMS SHOULD DO TODAY
The teams getting the most leverage from coding agents in 2026 share a common pattern: they use agents for tasks that are well-specified, reversible, and verifiable. New feature modules with clear API contracts, test generation for existing code, dependency upgrades with test suites that catch regressions, documentation generation from source — these are the high-value, low-risk applications. They avoid using agents for tasks that require contextual judgment, touch production data paths, or involve code where the correctness criterion isn't expressed as a test.
Invest in making your codebase agent-legible. This means descriptive file and function names, clear module boundaries, comprehensive test coverage, and documented conventions. Agents navigate codebases the way a new developer does — they rely heavily on naming, comments, and test structure to understand intent. A codebase that's easy for a new engineer to understand is a codebase that coding agents work well in. The refactoring pays dividends beyond the agent use case.
Treat agent output the same way you'd treat a junior developer's first PR: competent on the mechanics, requiring review on the judgment. Set up your workflows so agent-generated code goes through the same review gates as human-written code. Don't approve PRs faster just because an AI wrote them — the review process exists to catch exactly the kinds of errors that confident but inexperienced contributors make, and the current generation of coding agents is nothing if not confident.