Developer Experience is Agent Experience
A few weeks ago I read Jamon Holmgren’s piece on running a night-shift agent against his projects1. One line stayed with me: “I’m spending far less time babysitting, much more time thinking about the problems I need to solve, and my productivity soars.” That’s the shape of work I wanted. Not more output. More distance between me and the plumbing.
Jamon is in the React Native and web world. I’m in iOS. And the question I started turning over wasn’t “can an agent implement a Linear ticket for me?” Anyone can demo that. The real question was whether I could build a system that was fire and forget. Type plan-executor VE-123, walk away for an hour, come back to a feature that was implemented, exercised in the simulator, and committed to a branch. No user intervention after launch. The agent closes its own loops, catches its own regressions, drives the simulator to verify the thing actually works at runtime, and stops itself when it goes off the rails.
So I spent about a month building my own tools. This post is the first in a series about them. I’m going to walk through the system, share the reasoning behind each move, and eventually point you at an open-sourced version you can take and adapt. This first post is the map. Later posts go deep on the specific pieces.
I should say up front: I’ve been staring at this system so long I can no longer tell whether it’s novel or obvious. That’s what happens when you build something in your own basement for a month. I’ll let you decide.
Though the system itself is mine, the foundation under it isn’t. What an agent needs to work reliably on a real iOS codebase was already sitting there, built by a community that’s been quietly investing in developer experience for years. My job was mostly to notice what was there and wire it together.
A DX investment you already made
We’ve cared about developer experience as a community for a long time. If you’ve been doing iOS for a while, chances are you’ve watched at least one of Krzysztof Zabłocki’s talks or used one of the open source tools he’s shipped. Point-Free’s libraries and videos have probably shaped how you think about architecture and testing in Swift. Antoine van der Lee’s RocketSim is a quiet “how did we work without this” tool once you’ve installed it. I could keep this list going for a long time and still miss people who deserve credit.
Linters, formatters, previews, modularization, design systems, the whole discipline of making it pleasant to work inside a codebase. Those investments weren’t cheap. They also weren’t, strictly speaking, just for us anymore.
Every piece of that DX work turns out to feed agent experience one-to-one. A well-modularized project is easier for an agent to reason about. Fast build times shorten the agent’s feedback loop. A design system with named tokens stops the agent from hallucinating hex codes. Tools that render any single screen in isolation without booting the whole app (Playbook2 on iOS, Storybook on the web) give the agent something it can actually see. But rendering isn’t running. Launch arguments paired with mockable dependencies let the agent boot the app straight to the screen it changed, drive it like a user would, and verify the feature actually works. How that’s wired up, and how I taught the agent to drive it, gets its own post later in the series.
We were investing in agent experience for years before we had agents worth giving the gift to. The return on that investment has never been higher. And the teams that cut corners on DX in the name of shipping faster are going to feel it now, because the agent will trip on every shortcut they took.
What I mean by skills, and why iOS makes this harder
With that foundation in the ground, the rest of this post is what I actually built on top of it. A Claude Code skill is a scoped set of instructions the agent loads when a task matches. Think of skills as specialists: one for planning an issue, one for executing a plan, one for reviewing code. Orchestration is chaining them: plan hands off to executor, executor hands off to review, review hands off to a session summary that feeds back into the skills themselves. Nothing exotic. Just roles with handoffs.
The bar I was aiming for is worth being explicit about. I didn’t want a system that’s useful with careful supervision. I wanted one I could launch and walk away from. Every place where the agent would otherwise stop and ask me a question became a design problem to solve: the agent needs tools to answer the question itself, or documentation to look it up, or permission to make the call and keep going. After the plan is approved, I don’t touch the keyboard again until the feature is in a commit.
The part that makes this harder on iOS than on web is the feedback loop. A backend agent has containers and headless browsers and curl. It can spin up a real version of the thing, poke it, read structured output, know whether its change worked. iOS hands you almost none of that out of the box. You can build that loop yourself, and most of this series is about how, but the platform doesn’t make it free. Our world is Xcode, simulators, and build times measured in tens of seconds at best. A simulator is a heavy, stateful beast that takes seconds to boot and real minutes to exercise. There’s no headless mode an agent can drive in 200 milliseconds and parse JSON from.
The cost compounds when you try to run more than one agent at a time. A backend team spins up five containers on one laptop and nobody notices. On iOS, parallelism hits a wall fast. Every worktree is gigabytes of repo, more gigabytes of DerivedData, plus its own simulator state and Xcode index. Stack a few of those, throw in Apple’s simulator “optimizations,” and the machine starts swap-thrashing not long after. And that’s after writing enough glue to stop agents from stepping on each other’s simulators and build artifacts. Fullstack devs get containerization: 256 megabytes, boots in a second, tear it down, spin another one up. iOS engineers get gigabytes per worktree, and Apple’s official scaling advice is “here’s a Mac for four grand, go buy one.” That’s the gap. Every piece of the system I’m going to describe exists, in some way, to narrow it.
The system at a glance
In rough order of how things fire on a real ticket:
- linear-issue-planner: turns a Linear issue into an implementation plan via a structured interview with me. The deliverable is a plan detailed enough that a fresh conversation could execute it without asking follow-up questions. The “with me” part is deliberate. Planning is where I want to own architecture and API decisions, and locking those in up front is what makes the execution phase deterministic enough to walk away from.
- plan-executor: takes the plan and implements it phase by phase. Builds, tests, fixes, repeats. Catches its own errors early, because a compile error in phase two is a fourteen-second fix, and the same error compounding through phase five is a tangled mess. Along the way it pulls in a few specialists:
- figma-to-swiftui: converts designs to SwiftUI views using the project’s DesignSystem tokens. Stops the agent from inventing hex values and dumping hardcoded colors into views.
- behavioral-verification: a two-stage pipeline that runs after implementation, effectively an E2E test suite the agent both generates and executes. Stage one designs scenarios from the requirements alone, with no code access. This is black-box testing in the literal sense, so test design isn’t biased by what the agent just wrote. Stage two adds mechanical details (coordinates, mocks, launch arguments) and the executor drives the simulator through them, tapping, swiping, taking screenshots, comparing to Figma. Playbook handles visual scenarios; the real app booted with controlled state handles behavioral ones. The result: a per-scenario pass/fail that catches what the unit and integration suites miss.
- a review panel: four parallel reviewer agents reading the diff from different angles: architecture, test coverage, plan adherence, runtime correctness. A fifth, cross-model reviewer running through the Codex CLI gets pulled in when it’s installed, because a single-model panel is an echo chamber. The model that wrote the code rationalizes the code, and a different training run catches what the first one talked itself into.
The next post walks through how all of these came to be, and a few of them get deeper treatment in the posts after that. The point here is the shape, not the depth. A set of specialists, each narrow enough to do one thing reliably, chained into a pipeline that runs unattended from plan to commit. No checkpoints I have to answer. No “please test this manually” at the end. That’s the shape of the thing.
Underneath all of this is a layer that doesn’t fire on a ticket but makes the whole system debuggable: every run writes a structured session summary, a filtered build and test log, and a per-tool-call trace. Reading those files between runs is what tells me where the agent stalled, gave up early, or burned context, and what drives the next round of skill changes. Without it, improving the skills is guesswork.
What it’s done to my working days turned out to matter more than the mechanics.
Why I got my sanity back
Before any of this existed, I used to run four parallel conversations. Four worktrees, four agents, four tickets, my attention bouncing between them like I was running a small call center. It felt productive, and maybe it was. I never ran the evals to know either way. What I’m sure of is that I was cooked before I’d even taken a lunch break. Signing off at the end of the day didn’t fix it either. I needed a deliberate wind-down session before my brain actually went quiet.
Multitasking was the buzzword of the 2010s. I remember when it shipped in iOS 4 and felt like a huge deal at the time. By mid-decade, everyone was bragging about juggling six tabs and three Slack channels and a standup at the same time. Then the research came out and the industry quietly backed off: focus beats split attention, deep work beats shallow. And now I watch the same industry walking right back into the exact same trap, this time with agents. More worktrees, more parallel sessions, more streams of output to supervise. Same exhaustion, dressed up in a new outfit.
Simon Willison wrote something around the time I was noticing this: that he’d be drained by 11 a.m. and it was starting to worry him3. It landed, because I was exactly there. I’d close the laptop at midday with the vague sense that I’d produced a lot and done nothing of quality. That’s what cognitive overload feels like right before it stops being cognitive and starts being something you take to a therapist.
The system I’m describing in this series is the thing I built to stop doing that. The shape of a working day now is: one issue at a time, a planning skill I work with slowly to produce a really good plan, then plan-executor VE-123 and I walk away for an hour. The attention I used to spend babysitting build logs goes somewhere that matters more: meetings I’m actually present in, business requirements I can actually think through, the bigger picture of where the product is going that I couldn’t see when I was stretched across four worktrees. Other times I just step away from the laptop, make coffee, take a break. Either way I come back later and review the worktree. One plan done well beats four kicked off in parallel. It’s the same lesson we already learned about tabs, re-learned with agents.
That’s the split I’ve landed on. Let the agent do the coding. It’s good at it, with the iOS-specific help I had to build to make it reliable. That frees us up for what we’re still better at: the software engineering decisions, and the time spent with people figuring out what we’re actually trying to build.
Night shift is the same autonomy bar stretched across multiple tickets. Once plan-executor VE-123 runs unattended for one ticket, running it for ten is mostly a bash loop with rate-limit backoff and draft-PR creation. The best proof that this is real and not a productivity pitch is that I even built a night shift mode and I don’t actually use it. The machinery is all there: Linear integration, a dual-mode executor, draft PRs waiting in the morning. I choose not to run it overnight. Post four in this series is about why. The short version: knowing an agent is churning on my code while I’m signed off costs something I’m not willing to pay yet. Not forever, probably. I’ll get there at my own pace. Agents are good at writing code when you give them tools to verify their work. They’re not yet the right shape to replace sleep.
What’s coming
This is the intro. Four more posts are queued.
A colleague of mine saw the pull request where I added these skills to the project I’m working on (plus five thousand lines, minus three hundred) and asked how the fuck I wrote it. I didn’t, not in one sitting. The next post is the origin walkthrough, a month from a single .md file to what’s there today, with each novel move surfacing as a named moment in the timeline.
After that, the technical heart of the series: closing the feedback loop on iOS. Why 169/169 tests passed didn’t save me from a broken feature, what Playbook and XcodeBuildMCP and screenshots do together, and the two-phase verification methodology that keeps the loop honest. If you only read one post in this series, it’ll be that one.
Then the “I built a night shift and kept working days” post I just teased. Then a field report on tuning a skill when Opus 4.7 shipped: the targeted guardrails I added to the planner, and the two “obvious” improvements that didn’t survive a closer look.
The skills are open-sourced4. They’re specific to my project in places (iOS plus KMP, our Figma setup), but a big chunk is general. There’s also a prompt in the repo you can feed to an agent to adapt the skills to your own codebase. Give a man a fish and he eats for a day; teach the agent to fish and you save yourself a lot of typing.
Try something. Tell me what breaks.
Footnotes
-
https://jamon.dev/night-shift — Jamon Holmgren on running an overnight agent against his projects. ↩
-
Playbook is a Swift library from DeNA inspired by Storybook — https://github.com/playbook-ui/playbook-ios. ↩
-
Simon Willison’s blog — https://simonwillison.net (cite specific post when located). ↩
-
Repo link goes here once the skills are public. ↩