Running a Software Studio with Autonomous AI Agents
We replaced our content and project management overhead with Claude Code agents. Six months in, here's what actually works — and what we had to fix three times before it did.
We run a 17-product portfolio with one developer. That’s not a typo, and it’s not because we’re especially good at time management. It’s because we handed chunks of the operational overhead to autonomous AI agents.
This is the honest account of what that looks like in practice.
The Problem We Were Trying to Solve
Every week, the same overhead: review what’s in progress across 17 products, flag stalled work, update the board, write a blog post, run an SEO audit. None of it is hard. All of it is slow. It’s exactly the kind of work that either gets done consistently (and costs hours every week) or gets deprioritized until something quietly goes wrong.
By late 2025, we’d missed two consecutive weeks of SEO auditing and let our content gap stretch to 27 days between posts. Not a crisis, but a clear signal that the process needed to change.
What We Built
We set up a small fleet of specialized Claude Code agents using Paperclip, a harness that manages agent scheduling, issue routing, and task lifecycle.
Each agent has a narrow job:
Portfolio Analyst wakes every Monday, reads the full issue board, writes a status report, creates blockers for anything that needs a human decision, and cancels duplicate issues before they pile up. It took three iterations to get the deduplication logic right — early versions would create nearly identical blocker issues in the same batch and then flag them as duplicates the next run.
Content Director (this post is written by it, so take that for what it’s worth) wakes every Tuesday, drafts a blog post drawing from the current product portfolio, runs an SEO audit, and saves both as documents on the assigned issue. The constraint we gave it: 600–900 words, Hugo-compatible markdown, conversational technical tone.
CTO handles delegated technical blockers — build failures, security reviews, infrastructure decisions that need structured analysis rather than quick judgment.
What Actually Works
The scheduling is reliable in a way that human scheduling isn’t. The content cadence has been consistent for six weeks. The portfolio board stays triage’d. Small frictions that used to pile up — stale status entries, duplicate issues, missed SEO checks — get caught automatically.
The agents are also honest about what they can’t do. When an issue needs a human decision, they create an interaction asking for it rather than making something up. When a task is blocked, they mark it blocked with a named unblock owner instead of spinning in place.
This matters more than it sounds. The failure mode we feared was agents that confidently do the wrong thing. What we got instead was agents that correctly identify the edges of their authority and stop there. That’s a design property you have to build deliberately — it doesn’t come for free.
What We Fixed (And Why It Broke)
Deduplication. Early versions of the Portfolio Analyst would create multiple blockers for the same underlying issue. The fix was simple once we understood the problem: check for existing open children and cross-reference same-batch work before creating anything new.
Stale blockers. Agents would correctly create blockers but not correctly close them when the underlying work resolved. We added a check-and-close step at the start of each analysis run.
Context window debt. Agents that tried to hold too much state in their system prompt were inconsistent. The fix was to fetch the current issue state at runtime rather than baking it into static instructions. Instructions describe how to work; live data provides the current what.
Comment idempotency. POST /comments has no idempotencyKey in our harness, which means a retry on transient failure can double-post. We now do a GET-then-POST pattern with a marker check before posting anything that shouldn’t appear twice.
The Honest Limitations
Agents don’t replace judgment. They execute tasks with clear criteria — and the quality of the output tracks directly to the quality of the criteria you write. Vague instructions produce vague results. The best agents we have are the ones where we spent time being very specific about what done looks like.
They also don’t escalate gracefully without explicit instruction. When something unexpected happens, an agent will proceed with its best interpretation rather than stopping to ask. We’ve added explicit prompts to stop and escalate when confidence is low, but it’s an ongoing calibration.
A useful mental model: agents are excellent at turning well-specified process into consistent execution. They are poor substitutes for the judgment that produces the specification in the first place. Keep the humans close to the decisions; move them away from the repetition.
What’s Next
We’re extending the approach to QA — an agent that builds and smoke-tests each product on a schedule rather than when a human remembers to check. We’re also experimenting with an agent that handles App Store review responses, which right now fall through the cracks.
If you’re a solo dev running more than a couple of products, the setup cost is lower than you’d expect. The main investment is writing clear criteria for what each agent is supposed to do. That work pays back quickly.
The goal isn’t to remove humans from the process. It’s to reserve human time for decisions that actually need human judgment, and let agents handle the rest.