Why We Stopped Asking AI to Generate Patches and Put It Inside Docker Instead

A lot of AI coding automation starts with the same pattern:

Collect repository context, build a prompt, call a model API, parse the response, apply the returned patch.

That works for small changes.

If the task is “rename this function,” “add a simple test,” or “update this config,” a direct model call can be enough. The model does not need much context, and the risk is limited.

But once the task needs real codebase understanding, that approach starts to fall apart.

The model only knows what you send it. If you miss a helper, fixture, interface, config file, or existing test pattern, the output can be wrong in subtle ways. If you send too much, the prompt becomes noisy and the model starts spending attention on irrelevant files.

We were asking the model to act like a developer, but giving it a flattened, incomplete snapshot of the repo.

So we changed the setup.

Instead of asking a model outside the repository to generate a patch, we started running Claude Code inside a Docker container.

That turned out to be a much better fit.

The problem with API-style code generation

Our old workflow looked clean:

Pick relevant files.
Build a prompt.
Ask the model for code.
Parse the response.
Apply the patch.
Run checks.

The weak point was context selection.

For real code changes, “relevant context” is hard to know upfront. A file may depend on a helper two directories away. Tests may use project-specific fixtures. A convention may only be obvious after reading several similar files.

If we did not include that context, the model guessed.

Sometimes it guessed well. Often it produced code that looked plausible but did not fit the repository.

It might use the wrong mock style.

It might put tests in the wrong folder.

It might invent a pattern the project did not use.

It might ignore a framework convention that was obvious only from nearby files.

Patch application also created problems.

The model might return a malformed diff, stale line numbers, incomplete file contents, or prose mixed with code blocks. Even when the idea was good, turning the response into real file edits was fragile.

That felt backwards. The model was capable of reasoning about code, but we were forcing it to communicate through a brittle patch format.

What changed with Claude Code in Docker

The new workflow is simpler:

Claude Code gets a disposable workspace inside a Docker container.

It can inspect the repository, read files, follow patterns, edit code directly, and run lightweight checks when available.

At the end, we collect the actual git diff.

That difference matters.

Instead of asking, “Did the model format the patch correctly?” we ask, “What changed in the working tree?”

That is much closer to how normal development works.

A developer does not receive five pasted files and return a diff in a chat message. A developer opens the repo, searches, reads related files, makes edits, runs checks, and submits a diff.

Claude Code inside Docker gives the agent a version of that workflow.

Docker gives us the boundary around it.

Each run starts from a clean image and a clean checkout. Claude Code can work inside the container, but it does not directly modify the host repository. When the run finishes, we inspect the diff, collect logs, validate the output, and throw the workspace away.

The agent gets freedom to work.

The system keeps control.

Why Docker helped

Docker gave us three useful things: isolation, repeatability, and auditability.

We could bake the image with the tools we wanted:

Claude Code CLI
Git
language runtimes
package managers
Python and Node dependencies
repository analysis helpers
pinned tool versions

That made CI behavior more predictable. Instead of depending on whatever happened to be installed on a runner, the agent started from the same environment every time.

It also made failures safer.

Claude Code could explore and edit inside the container, but the final artifact was just a diff. Our orchestration layer could inspect it, reject unsafe changes, run quality gates, and decide whether to publish it.

That was the important lesson:

Docker did not make the AI trustworthy. Docker made the AI’s work easier to contain, reproduce, and review.

That is the right mental model.

We still had to guide the agent

Giving Claude Code full repository access helped, but it created a new problem: exploration.

If you drop an agent into a large repo with vague instructions, it may spend too much time wandering. It can inspect irrelevant files, chase unnecessary patterns, or burn time building a map it does not need.

So we gave it an evidence pack before it started.

The evidence pack included:

task summary
changed or relevant files
nearby tests
expected behavior
allowed edit targets
implementation hints
test-design guidance
examples of existing patterns

Claude Code could still inspect more files if needed, but it started with a map.

That balance worked best.

Do not trap the agent inside a fixed prompt.

Do not let it wander blindly either.

Give it enough context to start in the right area, then let it investigate.

What got better

The biggest improvement was repository awareness.

Claude Code could look around. It could read nearby files, inspect existing tests, follow naming conventions, and copy local patterns. That reduced the number of changes that looked technically valid but did not belong in the codebase.

The second improvement was patch quality.

Because Claude Code edited files directly, we saw fewer malformed patches and fewer translation errors. The output was a normal working tree diff, not a fragile response format we had to parse.

The third improvement was debugging.

Before, when a result was bad, it was hard to know why. Did we send the wrong context? Did the prompt fail? Did the model misunderstand? Did our patch parser break?

With the container approach, the failure was easier to inspect. We could see which files changed, what commands ran, what checks failed, and what diff was produced.

That did not make debugging effortless, but it made failures more concrete.

What still hurt

This setup is not magic.

The Docker image needs maintenance. CLI versions, runtimes, package managers, and helper tools need to be pinned and updated deliberately.

Image size also becomes a real issue. The more languages and tools you add, the heavier the image gets. A generic image should cover common cases, not every possible runtime.

Jobs can also be slower than direct API calls. Agentic coding involves reading, searching, editing, and sometimes running checks. That takes time.

And Docker does not solve every build problem. Some repositories need special compilers, private registries, licensed tools, old SDKs, or OS-specific dependencies. In those cases, the container can still help generate and statically validate code, but full verification may need repo-specific CI.

Credentials also need discipline. Secrets should never be baked into the image. They should be injected at runtime, scoped tightly, and kept out of logs and subprocesses whenever possible.

The key design rule

The most important decision was this:

Claude Code writes the patch. It does not decide whether the patch is acceptable.

Our orchestration still handles policy:

repo checkout
prompt construction
context gathering
diff collection
secret scanning
unsafe file rejection
quality gates
final publishing

Claude Code is the coding worker.

The validation layer is the authority.

That separation matters because an agent with repo access is powerful. It needs boundaries. Docker is one boundary, but it is not enough by itself. You still need diff validation, allowed edit scopes, secret scanning, logs, and CI checks.

No single layer should be trusted completely.

Final takeaway

Direct model API patch generation works for simple tasks, but it becomes brittle when codebase understanding matters.

The main problem is not that the model cannot write code. The problem is that real code changes require navigation, convention discovery, editing, testing, and review.

Putting Claude Code inside Docker gave it a real workspace while keeping the run isolated and disposable.

The container provided control.

The evidence pack provided direction.

Claude Code produced the diff.

The validation layer decided whether that diff was safe to use.

That was the real win.

Not “AI writes code and we trust it.”

More like:

AI works in a clean checkout, produces a diff, and the system treats that diff like any other untrusted contribution.