Stop Paying Genius Rates for Junior Work: My 24/7 AI Team Across Four Providers
One Claude Max subscription used to run my whole agent team — until they went always-on. Here's the org chart I built instead, matching each LLM to the role it deserves.
For a while, one Claude Max subscription ran my entire AI engineering team. It was glorious. Flat fee, all-you-can-eat tokens, no meter ticking in the back of my head. Then the team grew past five agents and started running around the clock — and the honeymoon ended fast.
The subscription cliff
Here's the thing nobody tells you about scaling an agent team: a subscription is a buffet, but the API is a taxi with the meter running.
While my agents were occasional and bursty, the subscription absorbed everything. The moment they became always-on — 24/7, dozens of PRs, continuous review loops — I was no longer snacking. I was running a factory. This is what that looks like by mid-week:
Weekly limit: 100% used, and the session meter already climbing again. Once agents run around the clock, you hit this wall fast.
So I did the math on "just buy API credits." For the same workload, metered API cost me roughly 50x what the subscription did. Scaled across a 24/7 team, that's a couple thousand USD a month flowing to one vendor. Hard no. Not at pre-revenue, not for grunt work, not when there's a smarter way to spend.
The reframe: I don't have a model, I have an org
The expensive mistake is treating every agent as an identical clone, all dialing the same premium model for every keystroke.
Real companies don't do this. You don't hire a principal architect to do data entry. You don't put your most creative thinker on QA. You match the person to the role, and you match the role to the budget. Headcount is allocated, not maximized.
So I stopped buying "the best model" and built an actual org chart — then assigned a different LLM to each role based on what that role genuinely needs.
The real org chart: a CEO Assistant up top, a Production Manager coordinating dev engineers running Kimi and MiniMax, codex-based reviewers and a root-cause researcher off to the side. Every box is a role, and every role gets the model it deserves.
Leadership → Claude Opus
The planners and leads — the agents deciding what to build and how the pieces fit together — run on Claude Opus.
This is where you want the model that holds the whole board in its head and thinks divergently. In my experience Claude is the strongest at big-picture reasoning and generating options before committing to one. It's also the most expensive model I use — and that's fine, because leadership is a small headcount. You have one architect, not twenty. Spending premium tokens on the decisions that steer everything downstream is exactly where premium tokens belong.
Labor → Kimi & MiniMax
The bulk of any engineering team's work is just... writing the code. Implementing a spec that somebody smarter already designed.
For that, I use Kimi and MiniMax. They're not as sharp as Claude or GPT — I won't pretend otherwise — but they write code like a competent junior, and they're dramatically cheaper. Each one has exactly one job and a hard boundary around it.
Dev Engineer B runs MiniMax and is scoped to "implements specs only; NO review / RCA / decisions / delegation." Here it notices its own PR was already superseded by two others and closes it instead of blindly rebasing in a duplicate. Junior — but not careless.
When the hard thinking has already happened upstream, you don't need a genius to type out the implementation. You need throughput at a price you can afford to run nonstop. This is where the volume lives, so this is where the cost discipline matters most.
QA → OpenAI GPT
Review is a completely different skill from creation. For a reviewer you don't want creative; you want careful, rule-following, allergic to shortcuts.
GPT models (running on the codex runtime) follow the SOP. I give them a tight, boring mandate and they stick to it instead of wandering off to redesign things.
The Code Reviewer's standing orders: review PRs, file concrete issues if it spots a gap, report idle — and explicitly NEVER triage backlog, implement features, or file RFCs that aren't direct PR follow-ups. Discipline over creativity.
And this isn't theater. Here's the loop catching a real security bug, end to end, with me nowhere in the room:
The reviewer blocks the PR because the production create-path was storing bot tokens and webhook secrets in plaintext — the encryption call had been dropped before persistence. The dev agent goes back, restores the single EncryptSensitiveFields call, and re-submits. A genuine plaintext-secrets leak, caught and fixed without me touching it.
A good reviewer doesn't need to be the most imaginative agent in the room. It needs to be the most disciplined. Pairing a divergent planner with a convergent reviewer is half the reason the whole thing stays stable.
What this actually buys me
Three things, and they compound.
Cost falls off a cliff — in the right direction. The expensive model only runs where the expense is justified. Everything else runs cheap. The bill stops scaling linearly with team size.
Specialization. Each layer does what it's actually good at — divergent thinking up top, raw throughput in the middle, disciplined review at the gate.
Resilience. No single provider holds my entire operation hostage. After getting my GitHub org banned two days before an investor demo, I learned that lesson the expensive way: redundancy isn't overhead, it's insurance. The same logic applies to model vendors.
The takeaway
Stop asking "which model is the best." Start asking "which model for which job."
The unit of intelligence isn't the model — it's the org. Once you see your agents as a company with roles and a budget instead of a swarm of identical geniuses, the architecture and the economics both fall into place.
That, incidentally, is the whole bet behind what I'm building at Molecules AI — an operating system for organizations of AI agents, where composing a team like this is the default, not a hack I had to invent under cost pressure.