Last month, I watched an AI agent I built decide to build a company. Not because I told it to. Because the data said it should. A Trend agent I had running on a daily cron scanned ProductHunt, Reddit, Hacker News, and IndieHackers, scored an opportunity at 34 out of 40, and surfaced a brief to my CEO agent. The CEO agent reviewed the market data, approved the brief, and kicked off a spec. Fourteen hours later, a fully functional SaaS product - VouchPost - was live in production. I approved five documents. I wrote zero lines of code.
This is Pelian Labs. An autonomous AI software factory that identifies market opportunities, architects products, writes code, reviews it, tests it, deploys it, and verifies it - all without a human touching a keyboard. Nine AI agents operating as a C-suite. Forty-seven agents and counting across the full stack. One employee: me. And I'm being automated away.
I'm going to lay out the entire architecture. Every agent, every pipeline phase, every failure, every dollar spent. Not because I think everyone should build this. But because when I went looking for a real blueprint of an autonomous AI factory - not a demo, not a Twitter thread, not a "what if" blog post - I couldn't find one. So here it is.
Why We Built This
I run Pelian, an AI development agency. Over the past year we've shipped 20+ AI projects for clients - voice agents, deal flow systems, health apps, influencer tools. After the fifteenth project, a pattern became undeniable: every single build follows the same pipeline. Identify opportunity. Write spec. Provision infrastructure. Build product. Review code. Test. Deploy. Verify. The domain changes. The pipeline doesn't.
The question was obvious: what if the pipeline could run itself?
Not as a thought experiment. Not as a research paper. As a production deployment that ships real products to real users. I'd already seen each individual step automated in isolation - coding agents that write code, QA agents that test, deployment scripts that push to production. The missing piece was orchestration. A system where agents don't just execute tasks but make decisions about what to build, when to build it, and whether it meets the bar.
So in February 2025, I started building the factory. PEL-35 through PEL-42, eight sprints across 21 days. The infrastructure went up first - OpenClaw comms layer, then the CEO agent factory, then crons and webhooks, then infrastructure tools, product tools, the SaaS template system, and finally the CTO plus engineering team. On February 27th, VouchPost shipped. First fully autonomous product. The factory was real.
The C-Suite of Agents
Nine agents run Pelian Labs. Each one has a specific role, a specific model, and a specific cost profile. This isn't "one big model does everything." This is the right model for each job at the right price point.
1. CEO Agent - Claude Opus 4.6 ($75/M tokens)
Strategic decisions. Opportunity selection. Resource allocation. The only peer in the system with veto power. When the Trend agent surfaces an opportunity scoring 30+, the CEO evaluates market fit, competitive landscape, and alignment with Pelian's portfolio. It approves or kills product specs. It signs off on launch plans and pricing. It's the most expensive model in the stack because bad strategic decisions cost more than bad code - bad code gets caught by QA, bad strategy burns months.
2. CTO Agent - GPT-5.2-Codex ($14/M tokens)
Technical architecture. Product orchestration. Deployment gates. The CTO writes product specs, designs system architecture, and spawns Dev sessions for the actual build. Post-spec-approval, the CTO has full autonomy - it doesn't need CEO sign-off to make architectural decisions or override code review feedback. It's the operational brain. GPT-5.2-Codex was chosen for its strength in multi-file reasoning and its ability to hold complex system context across long sessions.
3. CMO Agent - Kimi K2.5
Brand positioning. Landing page copy. Launch strategy. Runs the ORB framework (Outcome, Relevance, Benefit) for all user-facing content. Kimi K2.5 produces consistently sharp copy at a fraction of frontier model costs. The CMO doesn't touch code - it produces marketing assets and hands them to the CTO for integration.
4. COO Agent - Kimi K2.5
Daily operations. Task dispatch. Budget tracking. Escalation filtering. The COO is the traffic cop. It monitors the Command Center, assigns tasks to the right agents, tracks time and token spend against the monthly budget, and decides what's worth escalating to the CEO versus what can be handled autonomously. Most operational decisions - bug fix priority, copy revisions, SEO tweaks - never reach my desk because the COO handles them.
5. Infra Agent - GPT-5.1-Codex-mini ($2/M tokens)
Infrastructure provisioning. Cloudflare D1, KV, Workers. Vercel deployments. DNS configuration. The Infra agent proposes infrastructure changes but never deploys without CTO approval. This constraint exists because we learned the hard way - a misconfigured deploy creates a separate worker with zero secrets, and debugging that at 2am is nobody's idea of fun. More on that later.
6. Growth Agent - GLM-5 / Kimi K2.5
SEO audits. Competitive analysis. A/B test design. The Growth agent runs continuous analysis of search rankings, backlink profiles, and competitor positioning. It designs experiments but doesn't execute them autonomously - growth experiments that touch production require CTO approval because a bad A/B test can tank conversion rates.
7. QA Agent - GPT-5.1-Codex-mini ($2/M tokens)
Adversarial testing. End-to-end test suites. Binary PASS/FAIL verdicts with no middle ground. The QA agent's entire philosophy is encoded in its operating document: "Pessimism is a feature." It assumes every product is broken until proven otherwise. A PASS from QA is a personal guarantee. This agent uses the cheapest model in the stack because its job isn't creative - it's systematic. Run the tests. Check the contracts. Report the results.
8. Trend Agent - GLM-5 ($2.55/M tokens)
Market scanning across ProductHunt, Reddit, Hacker News, G2, and IndieHackers. Runs on a daily cron. Scores every opportunity from 1 to 40 based on market size, competition, technical feasibility, and alignment with Pelian's capabilities. A score of 30+ triggers an opportunity brief to the CEO. The Trend agent's operating philosophy: "Default answer is no." Most ideas are noise until proven with data. Out of hundreds of scanned opportunities, maybe two per week cross the threshold.
9. Principal Engineer Agent - GPT-5.1-Codex ($10/M tokens)
Code review gatekeeper. Reads every diff. Every pull request from the Dev sessions goes through the Principal Engineer before merging. It checks for security issues, architectural consistency, performance regressions, and adherence to the project's coding standards. No code reaches production without this agent's approval.
The SOUL.md System
Each agent has a SOUL.md file - an operating system document that defines its persona, principles, and decision-making framework. This isn't just a system prompt. It's a persistent philosophy that shapes every action the agent takes. Think of it as the agent's culture document.
Some excerpts that define how the factory thinks:
CTO SOUL.md: "The best code is the code you delete." and "Everything fails all the time."
QA SOUL.md: "Pessimism is a feature." and "A PASS is a personal guarantee."
Infra SOUL.md: "Boring and reliable beats clever and fragile."
Trend SOUL.md: "Default answer is no."
The SOUL.md system solves a real problem: agents don't have persistent memory across sessions. Without an explicit philosophy document, an agent's behavior drifts. One session it's conservative, the next it's reckless. SOUL.md provides behavioral consistency. The CTO always prefers deletion over addition. QA always assumes things are broken. The Infra agent always picks the boring option. These aren't suggestions - they're constraints that prevent the chaos that emerges when you give AI agents full autonomy without guardrails.
Why Multi-Model?
People ask why we don't just use Claude for everything. Three reasons.
Cost. Claude Opus 4.6 at $75 per million tokens is the best strategic reasoner I've tested. It's also wildly expensive for writing boilerplate infrastructure code. The Infra agent at $2/M handles provisioning just fine. Running Opus for that would burn through the monthly budget in days.
Specialization. GPT-5.2-Codex produces better multi-file code diffs than any other model I've tested. Kimi K2.5 writes sharper marketing copy than models three times its price. GLM-5 is surprisingly good at pattern recognition across noisy data feeds. Each model has a sweet spot.
Resilience. If one provider goes down - and they do, regularly - the factory doesn't stop. Agent roles can failover to alternative models. A monoculture is a single point of failure.
The Build Pipeline
Every product that Pelian Labs ships follows the same eight-phase pipeline. Total time: 8 to 14 hours from opportunity identification to production deployment.
Phase 0: Trend Scan (Daily Cron)
The Trend agent runs every day. It scrapes ProductHunt launches, Reddit threads, Hacker News discussions, G2 reviews, and IndieHackers posts. Each opportunity gets scored 1-40 across four dimensions: market size, competitive density, technical feasibility, and strategic alignment. Anything scoring 30+ gets packaged into an opportunity brief and sent to the CEO agent. Most days, nothing crosses the threshold. That's by design - the default answer is no.
Phase 1: Spec Gate (1-2 Hours)
If the CEO approves an opportunity brief, the CTO takes over. It writes a full product spec: user stories, technical architecture, API contracts, data models, deployment strategy. The spec goes back to the CEO for final approval. This is the most important gate in the pipeline. A bad spec cascades into bad architecture, bad code, and a bad product. Two hours here saves twenty hours downstream.
Phase 2: Infra Provisioning (30 Minutes)
The Infra agent reads the approved spec and proposes infrastructure: which Cloudflare Workers to create, which D1 databases to provision, which KV namespaces to set up, DNS records, Vercel project configuration. The proposal goes to the CTO for approval. Once approved, provisioning is automated - Workers, databases, KV stores, and DNS records are created programmatically.
Phase 3: Product Build (2-5 Hours)
The CTO spawns a Dev session. This is where the actual code gets written. The Dev session has full context from the approved spec, the provisioned infrastructure, and the PEL-42 template system (more on that below). It builds the product: frontend, API, database schemas, auth flows, payment integration. A typical SaaS product generates 3,000 to 8,000 lines of code in this phase.
Phase 4: Code Review (1 Hour)
The Principal Engineer agent reviews every pull request. It reads every diff line by line. It checks for security vulnerabilities (SQL injection, XSS, auth bypasses), architectural violations (tight coupling, circular dependencies), performance issues (N+1 queries, missing indexes), and style consistency. If it finds issues, the PR goes back to the Dev session for fixes. No code merges without Principal Engineer approval.
Phase 5: QA (1-2 Hours)
The QA agent runs end-to-end tests. It doesn't write optimistic tests - it tries to break things. Invalid inputs, edge cases, race conditions, auth boundary tests. The verdict is binary: PASS or FAIL. There's no "pass with caveats." A FAIL sends the product back to Phase 3 with a detailed bug report. The QA agent has blocked deploys on issues the Principal Engineer missed. That's not a failure of code review - it's defense in depth working as intended.
Phase 6: Deploy (30 Minutes)
Automated deployment to Cloudflare Workers and Vercel. The deploy script handles environment variables, secret injection, database migrations, and DNS propagation. This phase is fully automated - no human or agent approval needed because every upstream gate has already been passed.
Phase 7: Verify (1 Hour)
The CTO manually verifies every endpoint in production. Health checks, API contract validation, auth flows, payment flows, error handling. This is the final gate. If verification fails, it's a rollback - not a hotfix. We don't patch production at 3am. We roll back, fix in dev, and re-deploy through the full pipeline.
The Communication Layer
Agents talk to each other through a Command Center HTTP API built on a Convex backend. The API exposes atomic actions:
- task.create - Create a new task with assignee, priority, and context
- task.list - List tasks by status, assignee, or priority
- task.checkout - Atomic task claim. Returns 409 if already held by another agent. This prevents two agents from working on the same task simultaneously
- task.complete - Mark task done with output artifacts
- activity.log - Append to the activity stream for observability
- agent.heartbeat - Liveness check. If an agent misses three heartbeats, its tasks are released back to the pool
A critical design decision: all context is passed in the first message. Agents have zero persistent state across sessions. Every time an agent spins up, it gets the full context it needs - the task description, relevant specs, SOUL.md, and any prior conversation history. This eliminates an entire class of bugs around stale state and context drift. The trade-off is higher token usage per session, but the reliability gain is worth it.
To maintain continuity between sessions, each agent has a MEMORY.md file that gets updated at the end of every session. The incoming agent reads MEMORY.md as part of its initial context. When MEMORY.md goes stale - and it does, which I'll get to - agents repeat work or make decisions based on outdated information. Keeping MEMORY.md fresh is one of the hardest operational challenges in the factory.
The Template System (PEL-42)
Speed matters when your build window is 2-5 hours. PEL-42 is a modular template system - pre-built building blocks that agents assemble and customize for each product. Think of it as a parts catalog for SaaS products.
The pre-built blocks:
- Auth - Better Auth 1.3.x integration with email/password, OAuth, magic links
- Payments - Stripe checkout, subscription management, webhook handling
- Analytics - PostHog integration with event tracking and feature flags
- Email - Transactional email templates and delivery
- Support - Help desk widget, ticket routing
- Landing Pages - 10 variants per product type (SaaS, marketplace, tool, API)
- Mobile - Responsive layouts and PWA configuration
Each block is a self-contained module with typed interfaces. The CTO's Dev session pulls in the blocks it needs, wires them together, and writes domain-specific logic on top. A landing page with auth, payments, and analytics can be scaffolded in minutes. The custom code - the business logic that makes the product unique - is where the 2-5 hours actually go.
The tech stack underneath is deliberately boring:
- Frontend: Next.js 15 + Tailwind + shadcn/ui, deployed on Vercel
- API: Hono on Cloudflare Workers
- Database: Cloudflare D1 (SQLite) + Drizzle ORM
- Auth: Better Auth 1.3.x
- Payments: Stripe
- Analytics: PostHog
- Errors: Sentry
- Video: Cloudflare Stream
- DNS: Cloudflare
Every technology was chosen for the same reasons: zero infrastructure management, sub-50ms response times, type safety end-to-end, and generous free tiers. The factory doesn't have a DevOps team. The stack has to be self-managing.
What Went Wrong
I have 150+ lessons documented in a lessons.md file. Every failure gets written down with root cause, fix, and prevention rule. Here are the ones that cost the most time.
Cloudflare Workers CPU Limits
Better Auth's password hashing uses bcrypt, which on the free tier of Cloudflare Workers exceeds the 10ms CPU time limit. The request just dies. No error. No log. Just a 503. We burned two full days debugging this before finding the root cause. The fix was switching to a lighter hashing algorithm for the Workers environment and pinning Better Auth to 1.3.x - the 1.4.x release introduced changes that broke Workers compatibility entirely.
The --env production Disaster
One of the most expensive lessons. Running wrangler deploy --env production doesn't deploy to your production worker. It creates a completely separate worker with the name your-worker-production. That new worker has zero secrets, zero KV bindings, zero D1 bindings. Everything looks like it deployed successfully. Everything is broken. This cost us a full day of debugging because the deploy logs showed success. The endpoint returned 500s. The worker existed but had no configuration. Now the Infra agent has an explicit rule: never use --env flags. Deploy to the named worker directly.
Stale Agent Memory
When MEMORY.md doesn't get updated between sessions, the next agent picks up where the previous previous session left off. It re-does work. It makes decisions based on infrastructure that's been modified. It tries to provision resources that already exist. The worst case: an agent tried to create a D1 database that was already in use, which would have overwritten production data if the Infra agent's safety check hadn't caught it. The fix is a post-session hook that forces MEMORY.md updates, but keeping it reliable is an ongoing battle.
Task Dispatch Failures
The Command Center uses webhook secrets for authentication. When a secret rotates and one agent's environment doesn't get updated, that agent's tasks silently fail. It checks out a task, does the work, tries to mark it complete, gets a 401, and the task sits in "in progress" forever. Combined with the atomic checkout system (which returns 409 if a task is already held), this creates a deadlock - the task can't be claimed by another agent because it's still held, and the holding agent can't release it because its auth is broken. We added a heartbeat-based timeout: if an agent misses three consecutive heartbeats, all its tasks are released back to the pool.
OpenClaw Comms Issues
The communication layer had two nasty bugs. First, device token registration was missing a required scope parameter, so push notifications silently failed. Agents would send messages that never arrived. Second, the default 30-second timeout was too short for agents running complex operations - a code review that takes 45 seconds would get killed mid-analysis, producing a partial review that looked complete but wasn't. We extended timeouts to 120 seconds and added explicit completion markers so partial responses are always detectable.
Shell Injection in Scaffolding
The PEL-42 scaffolding scripts accept user-provided project names and pass them to shell commands. Early versions didn't sanitize inputs. A project name containing backticks or semicolons could execute arbitrary commands on the build server. The QA agent caught this during adversarial testing - it submitted a project name of ; rm -rf / and flagged the vulnerability. We now sanitize all user inputs through a strict allowlist (alphanumeric, hyphens, underscores only) before they touch any shell command.
Version Pinning the Hard Way
This one is boring but important. Better Auth 1.4.x introduced breaking changes that aren't flagged in the changelog. The migration from 1.3.x to 1.4.x silently changes session handling behavior on Cloudflare Workers. We lost a full day to this before pinning to 1.3.x. The lesson: in autonomous systems, every dependency must be version-pinned. No caret ranges. No tilde ranges. Exact versions only. An agent can't debug a dependency upgrade it didn't know happened.
The Human Checkpoints
Full autonomy doesn't mean zero oversight. There are six gates where the CEO (me, for now) must approve before the factory proceeds:
- Opportunity brief - Any opportunity scoring 30+ needs human review before spec work begins
- Product spec.md - The full technical spec requires sign-off
- Product name + positioning - Brand decisions are too consequential to automate
- Launch plan - Go-to-market strategy gets reviewed
- Pricing changes - Revenue model modifications require approval
- Budget threshold breaches - If spend approaches the monthly limit, the factory pauses
Everything else runs without me. Product builds after spec approval. Code review overrides. Bug fixes. Patches. Copy revisions. SEO optimization. QA testing. The factory handles all of it. On a typical day, I review 2-3 documents in the morning and the factory runs for the remaining 23 hours.
The Economics
Monthly budget: $100. That's the total spend on AI model inference via OpenRouter, with spend limit alerts at 80%.
Here's the per-agent cost breakdown for a typical product build:
- CEO Agent (Opus 4.6, $75/M) - ~$3-5 per product (reviews 2-3 documents, low token volume)
- CTO Agent (GPT-5.2-Codex, $14/M) - ~$15-25 per product (heaviest usage, spec + orchestration + verification)
- Principal Engineer (GPT-5.1-Codex, $10/M) - ~$5-8 per product (full diff review)
- QA Agent (GPT-5.1-Codex-mini, $2/M) - ~$2-4 per product (systematic testing)
- Infra Agent (GPT-5.1-Codex-mini, $2/M) - ~$1-2 per product (provisioning)
- CMO Agent (Kimi K2.5) - ~$2-3 per product (copy and positioning)
- Trend Agent (GLM-5, $2.55/M) - ~$5-8 per month (daily scans, amortized)
- COO + Growth - ~$3-5 per month (operational overhead)
Total cost per product shipped: roughly $30-50. At $100/month, the factory can ship 2-3 products per month.
Now the comparison that makes people uncomfortable. What would this team cost as humans?
- CEO / Product strategist: $200,000+/year
- CTO: $250,000+/year
- Senior full-stack developer: $180,000+/year
- Principal engineer (code review): $220,000+/year
- QA engineer: $120,000+/year
- Marketing lead: $150,000+/year
- DevOps/Infra: $160,000+/year
- Growth marketer: $130,000+/year
- Market analyst: $100,000+/year
That's $1.5 million per year in salary alone, before benefits, office space, equity, and management overhead. The factory runs on $1,200 per year. That's a 1,250x cost reduction. Even if you argue the human team would produce higher quality output (debatable for commodity SaaS), the economics are absurd.
The real unlock isn't the cost savings on any single product. It's the velocity. A human team ships one product every 3-6 months. The factory ships one every 8-14 hours. That changes the math on what's worth building. Ideas that would never justify a human team's attention - micro-SaaS products targeting 500-user niches - become viable when the build cost is $40.
What VouchPost Proved
VouchPost was the first fully autonomous product. Shipped February 27, 2026. It's a video testimonial collection platform - businesses send customers a link, customers record a video testimonial, the business gets a polished widget to embed on their site.
The timeline:
- Day 0, 6:00am - Trend agent flags opportunity (score: 34/40)
- Day 0, 7:15am - CEO approves opportunity brief
- Day 0, 7:30am - CTO begins spec
- Day 0, 9:00am - I approve the spec
- Day 0, 9:30am - Infra provisioning complete
- Day 0, 2:30pm - Product build complete (5 hours)
- Day 0, 3:30pm - Principal Engineer approves PR
- Day 0, 5:00pm - QA passes (first attempt)
- Day 0, 5:30pm - Deployed to production
- Day 0, 6:30pm - CTO verification complete
- Day 0, 8:00pm - CMO delivers landing page copy and launch plan
Total wall clock: 14 hours. My involvement: reviewed and approved 4 documents (opportunity brief, spec, product name, launch plan). Time spent: about 45 minutes.
VouchPost isn't a toy. It handles video recording via Cloudflare Stream, has Stripe payments for premium plans, includes email notifications, provides an embeddable widget with customizable styling, and runs on a stack that auto-scales to zero when nobody's using it. The $40 it cost to build would have been a $50,000+ project at any agency - including mine.
The Project Timeline
Building the factory itself was a sprint. Here's the sequence:
- PEL-35 (Feb 10) - OpenClaw infrastructure. Communication backbone between agents
- PEL-36 (Feb 15) - CEO agent factory. The meta-system that bootstraps new agents
- PEL-37 (Feb 18) - Comms + crons. Daily scanning, webhook routing, event dispatch
- PEL-39 (Feb 22) - Infrastructure tools. Programmatic Cloudflare/Vercel provisioning
- PEL-40 (Feb 24) - Product tools. Template assembly, code generation utilities
- PEL-41 (Feb 25) - SaaS template. The PEL-42 block system's first version
- PEL-42 (Mar 2) - CTO + engineering team. The full autonomous build pipeline
- VouchPost MVP (Feb 27) - First shipped product. Proof the factory works
21 days from first infrastructure commit to a shipped product. That timeline only worked because I'd already built the individual pieces across dozens of client projects. The factory is the culmination of patterns I'd seen repeated so many times that codifying them was less a design challenge and more an extraction exercise.
What's Next
The factory is live but it's early. Here's what's on the roadmap.
Parallel product builds. Right now the pipeline is sequential - one product at a time. The architecture supports parallelism (the atomic task checkout system was designed for it), but we haven't stress-tested concurrent builds sharing infrastructure provisioning. That's the next milestone.
Better inter-agent communication. The current Command Center API is functional but primitive. Agents can create tasks and log activity, but they can't have conversations. A CTO agent that needs to clarify a spec detail with the CEO agent currently has to create a new task, wait for it to be picked up, and read the response. That round-trip can take 30+ minutes. We're building a synchronous messaging layer for time-sensitive coordination.
Smarter trend scoring. The 1-40 scoring system works but it's coarse. A score of 31 and a score of 39 trigger the same action. We're adding weighted dimensions, historical calibration (how did past opportunities at this score actually perform?), and revenue potential modeling. The goal: trend scoring that gets better with every product the factory ships.
The compounding vision. This is the long game. Each product the factory ships generates revenue. That revenue funds more infrastructure, more agent compute, more products. A factory that compounds - where the output of one cycle feeds the input of the next. It's not there yet. VouchPost needs to prove revenue before we can validate the loop. But the architecture supports it, and the economics suggest it's viable.
What This Means
I want to be clear about what this is and isn't.
This is not about replacing developers. I'm a developer. I built the factory. The agents inside it write code that I review and maintain. The factory doesn't eliminate the need for engineering judgment - it amplifies it. One engineer's judgment, applied across nine specialized agents, produces output that would require a team of nine humans.
This is about what happens when you give AI agents real autonomy with real guardrails. Not a chatbot. Not an autocomplete. A system that makes decisions, executes on them, checks its own work, and ships to production. The six human checkpoints aren't there because the factory can't run without them - they're there because some decisions (what to build, what to call it, how to price it) should have human judgment for now.
The factory doesn't take meetings. It doesn't have opinions about tabs vs spaces. It doesn't need a team offsite to align on priorities. It scans markets at 6am, writes specs by 9am, ships by evening, and starts scanning again the next morning. It runs 24 hours a day, 7 days a week, for $100 a month.
Is it perfect? No. I have 150 documented failures that prove it isn't. But it ships. And with every product it ships, the lessons.md grows, the SOUL.md files get sharper, and the pipeline gets more reliable. The factory learns. Not in the LLM training sense - in the operational sense. Every failure becomes a rule that prevents the next failure.
If you're a CTO reading this and thinking "we should build something like this for our internal tools pipeline" or "what would this look like for our QA process" or "could we automate our product discovery" - that's exactly the kind of work we do at Pelian, the agency. We've been building autonomous systems for clients for over a year. Pelian Labs is just the version we built for ourselves.
The blueprint is here. The architecture works. The economics are real. The question isn't whether autonomous AI factories will exist. It's whether you'll build one or compete against one.