May 6, 2026

Closing the loop on iOS

Every iOS engineer has shipped a feature with a green test suite that turned out to be broken at runtime. It is not a failure of the tests. Unit tests check the unit. They cannot check what the user sees in the running app. The textbook answer to that gap is integration or end-to-end tests. The honest answer about iOS specifically is that they are expensive to write and maintain, which is why most projects (mine included) have less coverage there than they should.

When I’m at the keyboard, this is mostly fine. I run the app, look at the simulator, notice the empty screen, fix the subscription, move on. The thirty-second sanity check fills the gap the unit suite cannot cover. When I started letting an agent ship code, that thirty-second sanity check disappeared. The agent runs the build, runs the tests, reads a fully green suite, commits, moves on. Whatever was broken stays broken until I open the worktree to review and run the app myself.

The pattern showed up early and kept showing up. The agent would finish a ticket, declare the work done, point at a fully green suite as proof the feature was working. I would open the simulator and the new screen would be empty. A view model was not subscribing to the right publisher. Every unit test passed because every unit was correct in isolation. The bug lived in the wiring between them, which is exactly where unit tests are not designed to reach.

So I started thinking about what would actually close that gap. The goal was to make the agent verify its own work the way I would in the thirty-second sanity check: open the app, navigate to the screen it just changed, drive it like a user would, and check that the thing renders, responds, and behaves the way the requirements said it should. End-to-end testing, run by the agent, against the running app.

This post is what I built to make that possible. The reason it needs a whole post and not a footnote is that on the web most of it is free. A backend agent has containers, a headless browser, curl, structured logs, and ten cheap ways to poke a service and read what came back. iOS hands you almost none of that out of the box. The platform makes you build it yourself. Most of what follows is how I built it, why each piece is shaped the way it is, and what the agent could only do once it was wired up.

Pair-programming and delegation are different jobs

There are two modes of agent work, and they need different things from the platform underneath.

The first mode is the one Apple shipped in Xcode 26.3. The agent is a pair-programmer. You stay at the keyboard, you watch what it does, you accept or reject each step, you intervene the moment something looks wrong. The agent’s feedback loop is short because you are the feedback loop. If the screen is blank, you see it and say so.

The second mode is delegation. You hand the agent a plan and walk away. It implements the work, exercises it, commits, and stops. You come back later to review what shipped. The agent does not have you sitting next to it in real time. Whatever a human would normally notice mid-implementation, it has to notice itself.

Most of what I built is for the second mode, and only for one slice of it. Planning is still mine because that is where I want to own the architecture and the API decisions. Review is still mine because that is where I want to actually understand what shipped before it merges. The agent owns the middle: take a plan, write the code, exercise it, stop at a commit. One slice in the middle running unattended while the bookends stay mine.

The reason that one unattended slice needs infrastructure that pair-programming mode does not is the same reason this post exists. When I’m at the keyboard, the simulator is right there and so am I. When the agent is implementing alone, the simulator might as well not exist unless I built it something it can drive. Every gap a human would fill with attention has to be filled with a tool, a script, or a piece of test infrastructure the agent can use without me. The rest of this post is the gaps I had to fill, and how.

Layer one: Playbook scenarios for visual fidelity

The first gap is that the agent cannot see. It writes a SwiftUI view, runs the test that asserts the view exists, and has no way to know whether the thing on screen looks anything like the Figma. A view that compiles, renders without crashing, and matches its snapshot can still be visually wrong in five different ways the test does not check.

The community already solved this for humans. Playbook¹, the Swift library inspired by Storybook, lets you register scenarios for individual views and browse them in a separate app target. You point it at a view, you give it the data it needs, and Playbook gives you back a screen-isolated render with no app boot, no navigation, no real backend in the loop. This is what designers and engineers have been using for years to iterate on a single component without launching the whole app around it.

What turned this into agent infrastructure was treating scenarios as part of every ticket. Every screen the agent touches gets a scenario registered, with the data state the requirements describe. A scenario for the error state shows the error state. A scenario for the loaded-with-data state shows that. Adding the scenario is a normal part of the implementation phase, not a one-time setup task somewhere else.

The shape of a scenario looks something like this.

enum ScenarioKey: String, CaseIterable {
    case homeEmpty
    case homeWithError
    case homeLoaded
}

@MainActor
struct AppScenarios {
    static func view(for key: ScenarioKey) -> AnyView {
        switch key {
        case .homeWithError:
            return AnyView(
                withDependencies {
                    $0.feedClient = .mock(.networkError)
                } operation: {
                    HomeView()
                }
            )
        // other cases follow the same pattern
        }
    }
}

The mocks live behind the same dependency boundary the real app uses, so the view inside the scenario is the real view rendered with controlled data, not a test stub. The agent can read the file, see how an existing case is wired up, and add a new one in the same shape.

Now the agent’s loop has eyes. Build the Playbook target, install it on the simulator, launch it with the scenario name passed in as an environment variable, wait for the screen to render, snapshot the UI, compare against the Figma export the planner already attached to the plan. If the snapshot is way off, the agent has something concrete to go fix. If it matches, that is one verified piece of the feature.

A note on what this is not. Playbook scenarios are not a substitute for SwiftUI Previews and they are not a substitute for snapshot tests. Previews are where I work on the view in Xcode while I am building it. Xcode 26.3 added a way for the in-Xcode Claude integration to capture and iterate on Previews, which is great for that mode. They still do not run outside Xcode, though, which means an agent driving a simulator from a terminal cannot reach them at all. Snapshot tests are pixel-level regression checks. Playbook scenarios live in a real app target the agent can build, install, launch with arbitrary inputs, drive, and screenshot without Xcode in the loop, which is what delegation mode needs. Prefire² is adjacent work that auto-generates snapshot tests and a playbook view from #Preview blocks, and if you live mostly inside Previews it is probably the easier on-ramp.

What this does not catch is anything that requires navigation, anything that depends on the real backend’s behavior, and anything where the bug lives in the wiring between screens. For all of that I needed a second layer.

Layer two: launching the real app directly into a controlled state

Playbook gets the agent into one screen at a time. The next problem is harder. Many bugs only show up once the app is actually running with its real navigation, its real state machines, and its real coordinators between screens. The agent needs a way to launch the real app, skip the normal boot flow, and land directly in the state the requirements describe. Without it, the agent has to drive through onboarding, login, and a series of taps every time it wants to verify a screen. That is slow, eats tokens on screen-by-screen interactions that have nothing to do with what is being tested, and creates a failure mode where the agent burns its budget trying to reach the right screen and gives up before it gets there.

The piece that does this is a single Swift file in the app target, only compiled in debug builds, that reads command-line arguments and environment variables at launch and rewires the app’s dependencies before the first view appears. I called mine TestLaunchConfig. The convention for invoking it looks like this.

xcrun simctl launch booted com.example.staging \
  --test-destination home \
  --test-mock feedClient:networkError

--test-destination home tells the app to skip onboarding and the splash and present the home screen as if the user were a returning, signed-in user. --test-mock feedClient:networkError swaps the live feed client for a mock that returns a network error. The app boots with that combination. The home screen renders through the real production code path, but with a feed client that fails on first call, so the agent sees the genuine error state the user would see if their connection dropped at the worst moment.

The shape of the file is small. It parses the arguments and environment, applies the overrides via withDependencies, and constructs the app inside that scope.

#if DEBUG
let destination = parseDestination(
    args: CommandLine.arguments,
    env: ProcessInfo.processInfo.environment
)
let mocks = parseMocks(
    args: CommandLine.arguments,
    env: ProcessInfo.processInfo.environment
)

withDependencies { deps in
    for mock in mocks { mock.apply(to: &deps) }
} operation: {
    AppView(initialDestination: destination)
}
#endif

In production the file is not compiled in. In debug it sits behind #if DEBUG and the app initializer calls into it. If no test arguments are passed, the app boots normally. If they are passed, the dependency stack and the initial destination get overridden before anything else runs.

What this enables is the part that does not exist on iOS out of the box. The agent can launch the real app directly into the state the requirements describe and drive it the way a user would. The piece that makes the driving possible is XcodeBuildMCP with its UI automation tools turned on. It exposes tap, swipe, type-text, screenshot, and log capture as tools the agent can call directly from its terminal session. The agent taps a button, takes a screenshot of the result, captures the simulator’s logs while interacting, and checks that the observed behavior matches the expectations. End-to-end testing, run by the agent, against the running app.

This is fundamentally different from XCUITest. XCUITest lives in its own test target, runs in a separate process, and is designed for a test runner. TestLaunchConfig lets the agent skip the test harness entirely. It is not a test running against the app. It is the app, running with mocks plugged in at the production boundary, driven directly through the simulator interface the agent already uses for everything else. The agent can also choose not to mock when the scenario calls for the real backend, for example when verifying an auth expiration flow or anything that depends on real server timing. Mocks are the default, the real backend is one decision away.

The cost of this is that you need a clean dependency boundary in the app already. Most modern iOS projects have one through swift-dependencies, the Factory library, or a hand-rolled DI container. If you do not, this is the largest single investment in this whole post and the one that pays back the most. Without it, every behavioral test starts from the splash screen and is fragile, because half of what it touches is real network and state the agent cannot control.

That is the second layer. Playbook gives the agent isolated views to look at. TestLaunchConfig gives it the running app with controlled state to drive. Together they cover the visual and behavioral gaps that unit tests cannot reach. What is left is making the loop around them fast and legible enough that the agent actually uses them inside an unattended run. Slow builds, noisy xcodebuild output, and parallel worktrees stomping on each other’s caches will eat the agent’s context budget alive without the right tooling around them. That is the next section.

What the agent reads to know what is happening

A delegated agent only knows what its tools tell it. It cannot peek at the simulator the way I can. It cannot scroll through Xcode’s build navigator. It reads strings: build output, runtime logs, crash reports, test results. Every one of those channels on iOS is, by default, hostile to an agent. They are verbose, repetitive, designed for human eyes scrolling through Xcode’s interface. Most of the work in this section is making them legible.

Start with build output. A typical xcodebuild invocation produces hundreds of lines of output for a successful incremental build and thousands for a clean build. Most of it is “compiling Foo.swift”, “compiling Bar.swift”, phase progress, deprecation noise that has been there for years. The signal is in the few lines where errors and warnings actually live. If the agent reads the raw output, two things go wrong: the compile error gets buried under five hundred lines of irrelevant compilation progress, and the context budget gets eaten by the noise. I had a script called xc-errors for this before xcsift³ existed, doing essentially the same thing in less polished form. Łukasz Domaradzki shipped xcsift as a Swift command-line tool that turns xcodebuild output into structured JSON, surfacing only errors, warnings, linker failures, and test results. If you are starting fresh today, use his. The token reduction is real and the structured output is easier for an agent to reason about than raw text.

Runtime logs are the next channel. Build output tells the agent if compilation worked. Runtime logs tell it whether the running app is doing what it is supposed to. XcodeBuildMCP exposes log capture as start_sim_log_cap and stop_sim_log_cap, and the agent uses these around every behavioral interaction: tap a button, capture the logs that came out, look for the markers the requirements care about. What makes this useful is os.Logger instrumentation already in production code at the right places. A view model that logs feed.loaded when its data arrives gives the agent a runtime signal that the data arrived. Without that line, the agent has no way to know whether the loading state finished or got stuck. Most production iOS code has some of this, but the gaps where it does not exist are exactly where the agent will struggle to verify behavior. Investing in production logging on the paths that matter is a multiplier for the whole loop.

Production logging covers the planned paths. When the agent is implementing a feature and hits something unexpected, it sometimes needs more visibility than the existing instrumentation gives. The skill for that is to add temporary Logger.error calls in the path under investigation, run the verification, read the captured logs, and remove the new logs before committing. The plan-executor’s references document this as a defined sequence: add, capture, read, remove. One constraint matters. During the dedicated behavioral verification phase at the end of implementation, the agent cannot add new debug logs. It can only read what is already there. That keeps the verification phase a real test of the implementation rather than a nudged version where the agent instruments what it wants verified. The full reasoning sits in the next section.

When the app crashes during a behavioral check, log capture goes silent and the agent has to recover from a different signal entirely. The simulator writes a crash report as an .ips file to ~/Library/Logs/DiagnosticReports/. The file is JSON, with a one-line header followed by the body, and the fields the agent cares about are predictable: exception.type, exception.signal, and either lastExceptionBacktrace or the faulting thread’s frames in threads. The verification skill includes a short Python recipe for parsing the file and extracting those fields. The most common cause of these crashes for me is initialization code that runs before the first log statement gets a chance to fire, which is exactly the kind of failure log capture cannot show because the app is dead before the logs would help.

All of this lives in a single project-level reference, docs/IOS_BUILD_TEST_COMMANDS.md. Build commands, runtime patterns, the difference between “build for tests” and “build to actually launch the app” (the second one needs CODE_SIGNING_ALLOWED left at its default, otherwise the app crashes silently from the agent’s perspective), the log capture flow, the crash report recipe, how to build one module instead of the whole project, how to run one test target instead of the whole suite, and the flags that make incremental builds faster.

We have all been there. You tell the agent to implement something, walk away for fifteen minutes, come back to find it still hammering on a compile error, a code signing issue, or a stale build cache that is hiding the real failure. It is not fireworks, but writing this doc is one of the cheapest, highest-impact things you can do for an agent on iOS. The recipe is simple. Tell the agent to try running the app and the tests, watch what it stumbles on, ask it what would have saved time if it had been documented, and save its answers into the file. Repeat once a week for a month. The doc grows by the kinds of friction your specific project actually produces, which is different from anyone else’s project. The kinds of things that end up in there: the build-for-testing plus test-without-building split that lets you rerun tests without recompiling everything. The flags that skip index store and macro validation when you do not need them. Lint and format only the files you changed, not the whole project. None of those are exotic. They are the kind of trick that lives in someone’s dotfiles, that nobody bothered to write down for the agent.

In my case the project is iOS plus KMP, which adds a second tier of failure modes. The Gradle cache going out of sync. The shared framework built for the production app when the agent wanted to run staging. The build succeeding while the wrong artifact ends up in the simulator. Every one of those was a thirty-minute detour the first time it happened and a one-line entry in the doc the second time. The specific entries are not the point. The point is that the agent does not get to discover the same gotcha twice.

The first post in this series argued that developer experience investments feed agent experience one to one. This is the most concrete example of that. The doc is for the agent now.

Behavioral verification: the methodology that ties it together

The previous sections gave the agent eyes, hands, and a way to read what is happening. This section is about the methodology that uses all of it. The hardest problem in delegated agent work is not getting the agent to verify its own code. It is getting the agent to verify its own code without quietly nudging the verification to match what it just wrote.

When the same agent designs the test, writes the code, and runs the test, it has implicit knowledge of the implementation in its context. That knowledge leaks. Test design drifts toward what the code does instead of what the requirements asked for. The agent does not do this maliciously. It does it the same way humans do, except humans usually catch themselves and pull back. Agents do not, and they will quietly produce a verification that confirms whatever they just built. The verification passes. The agent moves on. The bug ships.

The shape I landed on is three stages, each running as a separate agent in its own context, each with a different set of tools available to it. The discipline comes from what each stage cannot see, not from telling each stage to be unbiased. Tool boundaries are the discipline. Prose is not.

That means three different tool lists, one per stage:

Stage A (scenarios):  Linear MCP, WebFetch, Write
Stage B (annotation): Read, Grep, Glob, Write
Stage C (execution):  XcodeBuildMCP, Bash, Read, Edit, Write

Stage A designs scenarios. Its inputs are the Linear ticket, the Figma references, the Requirements and Visual Behavior sections of the plan. Its tool surface includes Linear MCP, web fetch, and file write. It cannot read source code. It cannot read tests. It cannot grep the repo. It produces a scenarios document: a numbered list of behaviors the implementation must exhibit, with the criteria for each one stated in user-observable terms, not implementation terms. “When the user opens the home screen offline, an offline message replaces the feed” is allowed. “When feedClient.fetch() throws .networkError and the view model maps it to .offlineState” is not.

Stage B annotates. It reads the frozen Stage A scenarios and adds the mechanical details Stage C will need: which TestLaunchConfig destination to use, which mock to plug in, which UI element to tap, what coordinate or accessibility identifier corresponds to that element, which log line to look for. Stage B reads code now. That is what it is for. The hard rule that holds this together is that Stage B is not allowed to modify, soften, or remove any scenario from Stage A. If Stage B disagrees with a scenario, or finds the requirements unclear, it records that disagreement in a separate Blockers section. The scenario itself stays exactly as Stage A wrote it. If the scenario is wrong, Stage C will fail it, and the failure will surface a real disagreement that I (or the agent in a later iteration) have to deal with.

Stage C executes. It runs in a fresh context, with no memory of the implementation phase and no memory of the scenario design, just the verify-plan from Stage B and the simulator. Its tool surface is XcodeBuildMCP and the file system. It cannot spawn other agents. It cannot edit source code, so there is no path to “tweak the implementation a tiny bit so the scenario passes.” It cannot add new debug logs (the constraint from the previous section), so it cannot move the goalposts of what counts as success. It boots the app with the launch arguments Stage B prescribed, drives the simulator, captures logs, takes screenshots, compares against the expected outcomes, and writes a per-scenario pass or fail.

When Stage C reports failures, the executor reads the verify-plan and treats those failures as a new set of requirements to satisfy. It goes back to the implementation phase, fixes what it can, and re-runs the full verification pipeline. The cycle is bounded. After a fixed number of rounds without convergence, the executor stops, writes a summary of the unresolved failures to the session file, and surfaces them for me to look at when I review the worktree. I would rather have it stop and hand me the problem than burn tokens spinning on something it cannot solve. The point of the budget is to make the agent honest about its own limits, not to extend its runtime indefinitely.

The reason this works is the same reason prose discipline does not. Telling a language model to be unbiased is the same as telling it to be a good verifier. It will agree. It will then produce biased output anyway, because under any kind of pressure it falls back to whatever produces the most coherent next token, and the most coherent next token is the one that matches what it spent the last hour writing. Take away its ability to read what it just wrote and the bias has nowhere to land. It can only verify against the requirements, because that is the only context the tools give it.

The immutability rule between Stage A and Stage B is the part that took me the longest to land on. The temptation, when Stage B reads the code, is to rationalize a scenario: the scenario asks for X but the actual data flow is Y, so let me update the scenario to reflect what the code does. Allow that even once and the whole methodology collapses into the single-agent self-confirmation it was supposed to prevent. The hard rule that scenarios are frozen at Stage A is what keeps the verification honest, even at the cost of occasionally running into a Stage A scenario that is genuinely wrong and produces a false-positive failure. I would rather deal with the occasional false positive than with a system that has quietly stopped catching real bugs.

This is the methodology. Not a tool, not a script, a discipline enforced by tool boundaries instead of prose. It is the piece that made delegation viable. Without it, every fully-green run carried the same nagging question of whether the agent had quietly rewritten the requirements to match what it built. With it, the run either passes a verification it could not influence (sometimes on the first try, sometimes after a few iterations), or the agent stops and hands me the unresolved failures with the evidence attached. Either way, no commit goes to the branch carrying a verification the agent was free to bend.

What it does not catch, and how I keep it running

Nothing I have described catches everything. Some of the gaps are obvious in retrospect. Others surfaced only after running the skills on real tickets and watching what the agent’s loop kept missing during my review.

Simulator flakiness is the most consistent one. The simulator is a heavy, stateful machine, and sometimes it does not respond to a tap, sometimes a launch hangs for thirty seconds and then comes back, sometimes the accessibility tree disappears and the agent cannot find the button it knows is there. The agent has retry logic for the obvious cases (relaunch the simulator, rebuild the app, try again), but flakiness eats time and tokens, and there is no clean fix. I budget for it.

The second gap is anything that depends on timing pressure. Stage C drives the simulator at human speed. It taps a button, waits for the response, captures the result. Bugs that need rapid double-taps to surface, race conditions between view appearance and a network response, anything where the order of events matters at sub-second granularity, the agent will miss. For those I still need to put real eyes on the app, usually during code review.

The third gap is whatever the scenarios did not cover. If the requirements were written for light mode and the agent verified light mode, dark mode regressions slip through. If the scenarios test the happy path and one error path, the second error path nobody thought about will surface in production. The verification is exactly as good as the scenarios going into Stage A. Better requirements still matter, and that part stays mine.

The fourth gap is the gap between simulator and device. The agent runs everything on the simulator because that is what it can drive. Some bugs only surface on a physical device under real conditions: thermal throttling, real battery drain, weak network, real haptics. None of this replaces the testing I still do myself, especially on projects without a dedicated QA team where I am the last line. The agent is another pair of eyes and hands on what can be verified in the simulator. The on-device pass, the exploratory check before merging, the smoke test on real hardware before something goes to TestFlight, those stay mine. The agent does not skip them, and neither do I.

Those are the gaps in the verification itself. There is one more thing worth mentioning here, less a gap and more an operational requirement. Running multiple agents in parallel only works if each one has its own worktree, its own simulator, and its own DerivedData. Without that, two agents share the build cache, race each other for the simulator, and produce a kind of corrupted soup that is harder to debug than either run on its own. The plan-executor handles the worktree creation as part of phase one, writes the simulator UUID into a per-worktree file so the verification phases pick the right one, and uses Claude Code’s lifecycle hooks to clean up when a session ends. The setup is not glamorous, but it is what lets me have an agent working in one branch while I am working in another, without either side noticing.

Closing

What I built was specific to my project in places (the SPM module structure, the Figma tooling I happen to have), but the shape is portable. Two layers, Playbook and TestLaunchConfig, for visibility into the running app. A handful of tools and docs to make the signals the agent reads legible. A three-stage methodology where the discipline lives in tool boundaries, not in prose. Together, those are what made delegation viable on iOS for me. Before I had them, I could not walk away from a ticket. After, I could.

If you want to take one piece of this and try it on your own project, the cheapest single investment is TestLaunchConfig. A clean dependency boundary, a debug-only file that swaps mocks at launch, the simctl invocation to land directly in the state you want. That alone changes what the agent can verify on its own. The other layers fit on top.

Try something. Tell me what breaks.

Playbook is a Swift library from DeNA inspired by Storybook — https://github.com/playbook-ui/playbook-ios. ↩
Prefire generates snapshot tests and a Playbook view from #Preview blocks, adjacent to what I built but solving a different problem — https://github.com/BarredEwe/Prefire. ↩
xcsift parses xcodebuild output into structured JSON for agents — https://github.com/ldomaradzki/xcsift. ↩

Closing the loop on iOS

Closing the loop on iOS

Pair-programming and delegation are different jobs

Layer one: Playbook scenarios for visual fidelity

Layer two: launching the real app directly into a controlled state

What the agent reads to know what is happening

Behavioral verification: the methodology that ties it together

What it does not catch, and how I keep it running

Closing

Footnotes