Apr 21, 2026

Developer experience is agent experience

A few weeks ago I read Jamon Holmgren’s piece on running a night-shift agent against his projects¹. One line stayed with me: “I’m spending far less time babysitting, much more time thinking about the problems I need to solve, and my productivity soars.” That’s the shape of work I wanted. Not more output. More distance between me and the plumbing.

Jamon is in the React Native and web world. I’m in iOS. He runs his system overnight, I run mine during the day. The autonomy is the thing I wanted, not the timing. I wanted to delegate the part agents are demonstrably good at (writing code) and keep my hands on the parts they are not yet good at: deciding what to build, and judging what came out. The shape I aimed for has one slice in the middle running unattended while the bookends stay mine. After I approve the plan, the agent implements the feature, exercises it in the simulator to verify it works, and stops at a commit on a branch. I open the worktree later to review what shipped.

So I spent about a month building my own tools. This post walks through the system and shares the reasoning behind each move, in a way that lets you take the patterns and try them in your own project. This is the map. The next post goes deep on the specific pieces.

Though the system itself is mine, the foundation under it isn’t. What an agent needs to work reliably on a real iOS codebase was already sitting there, built by a community that’s been quietly investing in developer experience for years. My job was mostly to notice what was there and wire it together.

A DX investment you already made

We’ve cared about developer experience as a community for a long time. If you’ve been doing iOS for a while, chances are you’ve watched at least one of Krzysztof Zabłocki’s talks or used one of the open source tools he’s shipped. Point-Free’s libraries and videos have probably shaped how you think about architecture and testing in Swift. Antoine van der Lee’s RocketSim is a quiet “how did we work without this” tool once you’ve installed it. I could keep this list going for a long time and still miss people who deserve credit.

Linters, formatters, previews, modularization, design systems, the whole discipline of making it pleasant to work inside a codebase. Those investments weren’t cheap. They also weren’t, strictly speaking, just for us anymore.

Every piece of that DX work turns out to feed agent experience one-to-one. A well-modularized project is easier for an agent to reason about. Fast build times shorten the agent’s feedback loop. A design system with named tokens stops the agent from hallucinating hex codes. Tools that render any single screen in isolation without booting the whole app (Playbook² on iOS, Storybook on the web) give the agent something it can actually see. But rendering isn’t running. Launch arguments paired with mockable dependencies let the agent boot the app straight to the screen it changed, drive it like a user would, and verify the feature actually works. How that’s wired up, and how I taught the agent to drive it, is the topic of the next post.

We were investing in agent experience for years before we had agents worth giving the gift to. The return on that investment has never been higher. And the teams that cut corners on DX in the name of shipping faster are going to feel it now, because the agent will trip on every shortcut they took.

What I mean by skills, and why iOS makes this harder

With that foundation in the ground, the rest of this post is what I actually built on top of it. A Claude Code skill is a scoped set of instructions the agent loads when a task matches. Think of skills as specialists: one for planning an issue, one for executing a plan, one for reviewing code. Orchestration is chaining them: plan hands off to executor, executor hands off to review, review hands off to a session summary that feeds back into the skills themselves. Nothing exotic. Just roles with handoffs.

The bar I was aiming for is worth being explicit about. The implementation phase had to run unattended. Planning and review stay mine because that’s where I want to own the decisions. Every place where the agent would otherwise stop and ask me a mid-implementation question became a design problem to solve: the agent needs tools to answer the question itself, or documentation to look it up, or permission to make the call and keep going. After I approve the plan, I don’t touch the keyboard again until the feature is in a commit waiting for review.

The part that makes this harder on iOS than on web is the feedback loop. A backend agent has containers and headless browsers and curl. It can spin up a real version of the thing, poke it, read structured output, know whether its change worked. iOS hands you almost none of that out of the box. You can build that loop yourself, and the next post is about how, but the platform doesn’t make it free. Our world is Xcode, simulators, and build times measured in tens of seconds at best. A simulator is a heavy, stateful beast that takes seconds to boot and real minutes to exercise. There’s no headless mode an agent can drive in 200 milliseconds and parse JSON from.

The cost compounds when you try to run more than one agent at a time. A backend team spins up five containers on one laptop and nobody notices. On iOS, parallelism hits a wall fast. Every worktree is gigabytes of repo, more gigabytes of DerivedData, plus its own simulator state and Xcode index. Stack a few of those, throw in Apple’s simulator “optimizations,” and the machine starts swap-thrashing not long after. And that’s after writing enough glue to stop agents from stepping on each other’s simulators and build artifacts. Fullstack devs get containerization: 256 megabytes, boots in a second, tear it down, spin another one up. iOS engineers get gigabytes per worktree. That’s the gap. Every piece of the system I’m going to describe exists, in some way, to narrow it.

The system at a glance

In rough order of how things fire on a real ticket:

linear-issue-planner: turns a Linear issue into an implementation plan via a Socratic interview with me. The deliverable is a plan detailed enough that a fresh conversation could execute it without asking follow-up questions. The “with me” part is deliberate. Planning is where I want to own architecture and API decisions, and locking those in up front is what makes the execution phase deterministic enough to walk away from.
plan-executor: takes the plan and implements it piece by piece. Builds, tests, fixes, repeats. Catches its own errors early, because a compile error caught the moment it appears is a fourteen-second fix, and the same error compounding through three more passes is a tangled mess. Along the way it pulls in a few specialists:
- figma-to-swiftui: converts designs to SwiftUI views using the project’s DesignSystem tokens. Stops the agent from inventing hex values or reaching for system fonts when the DesignSystem already has named tokens for both. Also handles the Liquid Glass mapping: without it, agents reach for older effects patterns instead of the iOS 26 view modifiers the Figma design actually calls for.
- behavioral-verification: a three-stage pipeline that runs after implementation, effectively an E2E test suite the agent both generates and executes. Stage one designs scenarios from the requirements alone, with no code access. This is black-box testing in the literal sense, so test design isn’t biased by what the agent just wrote. Stage two reads the code to annotate those scenarios with mechanical details (coordinates, mocks, launch arguments) but isn’t allowed to change them. Stage three executes them in fresh context, driving the simulator, tapping, swiping, taking screenshots, comparing to Figma. Playbook handles visual scenarios; the real app booted with controlled state handles behavioral ones. The result: a per-scenario pass/fail that catches what the unit and integration suites miss.
- a review panel: four parallel reviewer agents reading the diff from different angles: architecture, test coverage, plan adherence, runtime correctness. A fifth, cross-model reviewer running through the Codex CLI gets pulled in when it’s installed, because a single-model panel is an echo chamber. The model that wrote the code rationalizes the code, and a different training run catches what the first one talked itself into.

All of that runs inside a six-phase pipeline. Orientation, implementation, full test suite, behavioral verification, review, commit. Each phase has a retry budget. Three attempts at any single step, five at any phase, five full behavioral verification cycles, two review cycles. When a budget runs out, the executor stops, writes a summary of where it got stuck, and hands control back to me. These budgets are not autonomy theater. Their job is to make the agent honest about its own limits when it cannot make progress, instead of burning tokens in circles.

The point here is the shape, not the depth. The next post goes deep on the technical pieces that make this work on iOS. A set of specialists, each narrow enough to do one thing reliably, chained into a pipeline that runs unattended from plan to commit. No checkpoints I have to answer mid-run. The verification has already happened by the time I sit down to review. That’s the shape of the thing.

Underneath all of this is a layer that doesn’t fire on a ticket but makes the whole system debuggable: every run writes a structured session summary, a filtered build and test log, and a per-tool-call trace. The session file also acts as a resume point. If a run crashes or runs out of context, the next invocation reads the file and picks up from where the work left off. Reading those files between runs is what tells me where the agent stalled, gave up early, or burned context, and what drives the next round of skill changes. Without it, improving the skills is guesswork.

What it’s done to my working days turned out to matter more than the mechanics.

Why I got my sanity back

Before any of this existed, I used to run four parallel conversations. Four worktrees, four agents, four tickets, my attention bouncing between them like I was running a small call center. It felt productive, and maybe it was. I never ran the evals to know either way. What I’m sure of is that I was fried before I’d even taken a lunch break. Signing off at the end of the day didn’t fix it either. I needed a deliberate wind-down session before my brain actually went quiet.

Multitasking was the buzzword of the 2010s. I remember when it shipped in iOS 4 and felt like a huge deal at the time. By mid-decade, everyone was bragging about juggling six tabs and three Slack channels and a standup at the same time. Then the research came out and the industry quietly backed off: focus beats split attention, deep work beats shallow. And now I watch the same industry walking right back into the exact same trap, this time with agents. More worktrees, more parallel sessions, more streams of output to supervise. Same exhaustion, dressed up in a new outfit.

Simon Willison wrote something around the time I was noticing this: that he’d be drained by 11 a.m. and it was starting to worry him³. It landed, because I was exactly there. I’d close the laptop at midday with the vague sense that I’d produced a lot and done nothing of quality. That’s what cognitive overload feels like right before it stops being cognitive and starts being something you take to a therapist.

AI was supposed to make work easier. Mine had quietly made it harder. The system I’m describing here is the thing I built to stop doing that. The shape of a working day now is: one issue at a time, a planning skill I work with slowly to produce a really good plan, then plan-executor TICKET-123 and I walk away for an hour. The attention I used to spend babysitting build logs goes somewhere that matters more: meetings I’m actually present in, business requirements I can actually think through, the bigger picture of where the product is going that I couldn’t see when I was stretched across four worktrees. Other times I just step away from the laptop, make coffee, take a break. Either way I come back later and review the worktree. One plan done well beats four kicked off in parallel. It’s the same lesson we already learned about tabs, re-learned with agents.

That’s the split I’ve landed on. Let the agent do the coding. It’s good at it, with the iOS-specific help I had to build to make it reliable. That frees us up for what we’re still better at: the software engineering decisions, and the time spent with people figuring out what we’re actually trying to build.

What’s coming

This is the intro. The next post is the technical heart. Closing the feedback loop on iOS. The pattern where the agent declared a feature done and pointed at a fully green test suite as proof, but the simulator told a different story. What Playbook and XcodeBuildMCP and screenshots do together, and the verification methodology that keeps the loop honest.

The patterns in this post are written so you can take them and try them in your own project. The skills themselves are not going public. They grew shaped to one specific project, the kind of over-architected shape that doesn’t transplant cleanly to anyone else’s. The thinking is shareable. The files are not.

Try something. Tell me what breaks.

Jamon Holmgren on running an overnight agent against his projects — https://jamon.dev/night-shift. ↩
Playbook is a Swift library from DeNA inspired by Storybook — https://github.com/playbook-ui/playbook-ios. ↩
Simon Willison on Lenny’s Podcast, on being drained by 11 a.m. — https://simonwillison.net/2026/Apr/2/lennys-podcast/. ↩