Behind the scenes: how the 67-agent fleet actually works

When people hear “67-agent AI organization,” I get one of two reactions. Some people light up. Some people roll their eyes. Both are reasonable, because the term is doing a lot of work and most “agent” products in market today are not what we mean by it.

So this post is what we mean.

What an agent is, in BBB

An agent in BBB has four properties:

A persona — a markdown document describing the agent’s role, decision principles, what they own, what they refuse to do. Aiko’s persona has a section titled “Never fabricate a regulatory citation” because the consequence of fabricating one is real legal exposure. Lena’s persona has a section about how to think across QAR, AED, SAR, and USD without conflating them.
A skillset — a library of named skills the agent can pull from. A “skill” is roughly: a structured prompt template + a validation rubric + a few-shot example pack. Yara’s skillset includes email-sequence-designer, case-study-writer, microcopy-generator. Argus’s includes stride-threat-modeling, jailbreak-atlas-curator, honeytoken-deployer.
A calibration history — every drill the agent has ever run, scored 0–10 by Juana (our quality officer) and a second judge. The history is what determines the agent’s tier (we have five). Higher tier = more complex tasks routed to that agent.
A tier — 1 (apprentice) through 5 (graduated). Promotion happens at three consecutive scores ≥9. Demotion at two consecutive ≤4. The fleet calibrates itself; we don’t ship a “v2” of an agent — we let the drill data do the work.

The drill harness

The thing that makes the system actually self-improve is a separate process called the drill harness. It runs in the background, picks up where the academy schedule says to go, and assigns each agent a daily drill from a daily_drills.yaml of curated scenarios.

The drill is graded twice. Juana grades against the rubric (correctness, completeness, brand voice). A second judge — currently aya-expanse:8b — grades for coherence, usefulness, audience match, padding/filler. Both scores get averaged. If they disagree by more than 3 points, it gets flagged for human mentor review.

This is the layer that caught the parse bug last week. (Quick story: the second judge was emitting **Score:** 8/10 in markdown bold; our parser was strict-matching SCORE: 8 and returning 0 on every dual-graded drill. 16 departments were stuck at Tier 1 for weeks. Fix took an afternoon; the parser now handles every common output variant; 11 departments retroactively got promoted to T3+.)

How an agent gets work

When an inbound message arrives at BBB — say, a tweet at a customer’s account — it goes through this routing:

Inbound pipeline receives it, extracts text, scores complaint severity (0–100), classifies intent into one of 13 routing classes.
Routing matrix maps the class to the right department. Billing → Finance. Privacy → Legal. Performance → Engineering.
Department’s lead agent picks up the case. For Customer Care that’s Aiko. For Legal it’s also Aiko (different role). For Finance it’s Lena.
Lead delegates to specialists if needed — drafter for the reply, complaint-case opener for the workflow, audit-logger for the trail.
All work passes through approval gate by default. Human reviews, edits, approves, denies. Crisis cases auto-route for human review regardless of policy.
Outbound poster picks up approved drafts, posts to the right platform, marks the case customer-acked.

Every step writes to the audit log. Every step is keyed by your tenant ID. Every step is encrypted at rest with your tenant’s DEK.

What the agents don’t do

Worth saying out loud:

They don’t decide on your behalf. Every reply, decision, contract redline waits for human approval by default. Auto-publish is opt-in per channel.
They don’t have memory of other tenants. Multi-tenant RLS guarantees this at the database layer. Yara on tenant A literally cannot read Yara on tenant B’s brand voice; they’re in different rows of the same table.
They don’t replace your team. The math we use internally is: one BBB department replaces the first hire in that function — the one most SMEs can’t justify hiring. Your existing team gets leveraged 10× by the fleet, not replaced by it.
They don’t make things up about your business. Pre-flight catches fabrications before drafts reach you. If the agent doesn’t know your real customer count, it won’t invent one.

The bet, again

The bet behind BBB is that the operating system matters more than the agents. Anyone can make a Yara — a brand-voice drafter. Few can make her work alongside fourteen other specialists on the same multi-tenant platform with audit-by-default and per-tenant encryption.

That platform is the whole point. The agents are the visible part.

This blog will cover what we’re learning about both, as we ship.