Architecture · Draft
Agentic Orchestration Model
Overview
Digital media operations decompose into specialized agents supervised by a central orchestrator. Agents call domain services and platform connectors; they do not hold long-lived credentials directly — services inject scoped tokens per request.
Deployment: agents run on Vertex AI Agent Engine; models are served from Vertex AI Model Garden (see GCP deployment topology — AI & agent runtime). Cloud Run hosts the orchestrator control plane, connectors, and HITL API.
Agent hierarchy
Three layers — each with a defined model tier (see Model catalog):
| Layer | Agents | Responsibility | Default model tier |
|---|---|---|---|
| L0 — Router | Model router, intent classifier, playbook selector | Pick model + thinking level per task; never runs tools directly | T0 — gemini-3.1-flash-lite (minimal thinking) |
| L1 — Orchestrator | Orchestrator | Lifecycle state machine, event routing, escalation, re-plan triggers | T2 — gemini-3.1-pro-preview |
| L2 — Domain | Onboarding, Media Plan, Execution, Optimization, Reporting | End-to-end domain workflows with tool calling | T1–T2 (see roster below) |
| L3 — Sub-agents | Per-platform builders, feed validators, bid calculators, report summarizers | Single-purpose tool loops; narrow context; each paired with a QC sub-agent | T0–T1 |
| L4 — QC / checker sub-agents | One checker per high-stakes main task (see pairing table) | Independent Q/A verification; can block, request correction, or escalate | T0–T2 (Sonnet only on escalation) |
Agent roster
| Agent | Triggers | Outputs | Human gate | Suggested model |
|---|---|---|---|---|
| Orchestrator | Events, schedules, human approvals | Task routing, state machine transitions | Escalations only | gemini-3.1-pro-preview |
| Router / classifier | Every inbound agent task | Model tier, thinking level, playbook ID | None | gemini-3.1-flash-lite (minimal) |
| Onboarding | New client intake | Account map, tracking checklist, verification status | BM access grant, agency billing profile | gemini-3.5-flash (medium) |
| Media Plan | Brief, budget, vertical rules | Master plan, track drafts (always-on, branding, engagement), event drafts (special days) | Plan approval per track | gemini-3.1-pro-preview / gemini-3.5-flash (events) |
| Plan Revise | Drift threshold, reporting signal, client request | Per-track vN+1 revise/replan + diff vs manifest slice |
Same as plan approval (per track_id) |
gemini-3.1-pro-preview |
| Execution | Approved plan version | Campaign structures on platforms | Launch confirmation if policy requires | gemini-3.5-flash (medium; high for multi-platform) |
| Optimization | Performance deltas, rules | Bid/budget/audience/creative changes | Changes above threshold | gemini-3.5-flash (medium) |
| Reporting | Schedule, ad-hoc request | Reports, anomalies, opt changelog, plan drift, revise recommendations | None (read-only) | gemini-3.1-flash-lite (low) + Batch for digests |
| QC — Plan validator | Plan draft ready | Pass/fail + structured diffs vs brief | Blocks approval on fail | gemini-3.5-flash (high) |
| QC — Compliance | Creative / copy before launch | Policy flags (health, education, claims) | Blocks launch on fail | gemini-3.5-flash (high) |
| QC — Tracking health | Pre-launch, daily sweep | Green/amber/red checklist | Blocks spend increases on red | gemini-3.5-flash (low) |
| QC — Spend guardrails | Optimization proposals | Approve / escalate vs approved plan caps | Escalates above threshold | gemini-3.5-flash (medium) |
State machine (client lifecycle)
Orchestration patterns
1. Plan–execute separation
- Agents never execute spend against an unapproved plan version.
- Plan document is immutable once approved; changes create
plan_vN+1.
2. Guardrails (hard limits)
Examples (exact values configured per tenant):
- Max daily budget delta without approval: e.g. 20%
- Blocked actions: delete conversion actions, change billing account
- Required checks before launch: tracking health = green, feed errors = 0 critical
3. Tool access
Each agent has an allowlist of tools (service APIs):
Onboarding Agent → onboarding.createAccount, verification.request, bm.linkAsset
Plan Agent → planning.draftMaster, planning.draftTrack, planning.draftEvent, planning.validate
Plan Revise Agent → planning.revise, planning.diffManifest, planning.recommendRevise (per `track_id`)
Execution Agent → execution.applyPlan, execution.pauseCampaign, execution.syncManifest
Optimization Agent → optimization.propose, optimization.apply, optimization.computeDrift
Reporting Agent → reporting.changelog, reporting.planDrift, reporting.recommendRevise
4. Observability
Every agent run produces:
run_id,agent,tenant_id,track_id,plan_version- Input context hash (no PII in logs)
- Tool calls and outcomes
- Human escalation reason if blocked
QC correction loops are logged on every pass/fail and correction attempt — not only when A8 fires. See QC loop telemetry.
5. Per-task QC gates and correction loops
Every mutable or client-visible main task runs through a checker sub-agent (Q/A gate) before the orchestrator marks it complete or applies platform mutations. Read-only tasks (e.g. daily KPI digest) skip QC unless anomaly is detected.
Main task → QC checker pairing
| Main task (L2 / L3) | QC checker (L4) | Blocks on fail? |
|---|---|---|
plan.draft / plan.track.draft / plan.event.draft |
qc.plan |
Yes — no human approval until pass |
plan.revise (per track) |
qc.plan |
Yes — per track_id |
exec.campaign.build (creative included) |
qc.compliance |
Yes — no launch |
exec.campaign.build (structure only) |
qc.tracking + qc.plan (slice vs approved plan) |
Yes |
exec.campaign.mutate |
qc.spend + qc.tracking |
Yes if over cap or tracking red |
opt.cycle (per track) |
qc.spend + drift check |
Yes if over guardrail; may emit plan.revise.recommended |
onboard.step (tracking-related) |
qc.tracking |
Yes before go-live |
feed.validate |
qc.plan (catalog vs brief) |
Yes if critical errors |
report.anomaly |
qc.plan (vs approved KPIs) |
No — recommends per-track revise / replan |
report.plan_drift |
— | No — read-only; per-track weekly rollup |
Compliance QC may chain to qc.compliance.escalate (Sonnet) only on low-confidence or high-risk vertical — not on every run.
Loop limits (hard caps — no infinite loops)
Policy: agents never loop indefinitely. Every loop type has a hard ceiling enforced by the orchestrator in code — not by prompt instruction alone. When a ceiling is hit, the run stops immediately, state is frozen, and a red flag is raised for Kobi ops (see below). No silent retries, no "try one more time" without a new run_id and human acknowledgment.
| Loop type | Max iterations | On exhaustion |
|---|---|---|
| QC correction loop (main → QC → fix → QC) | 2 correction attempts per run_id |
Red flag → HITL ticket (A8); block downstream mutations |
| Tool-call loop (single sub-agent, multi-step tools) | 5 tool rounds per invocation | Abort; rollback marker; red flag → HITL (A8) |
| Orchestrator retry (transient API / timeout) | 1 full re-run per run_id |
Red flag → HITL (A8) |
| Cross-agent re-dispatch (same task, new agent) | 0 without human — must open new run_id after A8 resolution |
Prevents disguised infinite loops |
| Model tier promotion (QC fail rate spike) | Policy-driven — not per-request loop | Promote tier for task class 24h |
Global ceiling per run_id |
≤ 8 total LLM steps (main + QC + corrections + tools combined) | Red flag even if individual sub-limits not hit |
Configurable per tenant in loop_policy (implementation phase); defaults above are maximums — tenants may be stricter, never looser without admin override.
Red flag on loop exhaustion
When any loop limit is exceeded or the global run_id ceiling is hit:
- Stop — no further agent or tool calls for that
run_id; partial platform mutations rolled back or markedneeds_review - Emit
agent.loop.exhaustedwithtenant_id,track_id,agent,loop_type,attempt_count,last_qc_failure,run_id - HITL ticket (A8) — appears in Human Touch inbox with red priority, SLA timer, structured summary, and recommended actions; full step trace in System Ops
- Ops alert — notify on-call / ops channel (email, Slack, or PagerDuty — TBD); include tenant, task, and link to ticket
- Client impact guard — block spend increases, launches, and optimization applies tied to that
run_iduntil A8 is resolved or explicitly overridden (A6) by admin - No auto-restart — same logical task requires human to approve a new
run_idor take manual action; prevents loop-until-lucky behavior
QC checker receives: main agent output JSON, approved plan_version, relevant playbook slice, and only the fields needed to verify — not the full conversation history.
QC loop telemetry
Deterministic orchestrator logging (not LLM-generated summaries) records every QC gate and correction loop so ops can answer: which agents fail most, on which models, in which tasks, with which inputs/docs?
Emit on every QC invocation (agent.qc.result):
| Field | Example | Notes |
|---|---|---|
run_id, step_index |
run_abc, 3 |
Tie to Cost Guard token rows |
tenant_id, track_id, platform |
t_12, branding_q2, meta |
Where in the portfolio |
main_task_id, main_agent |
opt.cycle, optimization.meta |
Failing worker |
qc_task_id, qc_agent |
qc.spend, qc.spend.checker |
Which checker rejected |
attempt_number |
0 = first QC; 1 = after 1st correction |
Loop depth |
outcome |
pass | fail | escalate |
escalate → qc.compliance.escalate |
failure_codes[] |
budget_over_cap, tracking_pixel_missing |
From qc_result schema — structured, not prose |
model_main, model_qc |
gemini-3.5-flash, gemini-3.5-flash |
Models on this step |
thinking_level_main, thinking_level_qc |
medium, high |
Per thinking matrix |
plan_version, manifest_slice_id |
v14, mslice_9f2 |
Approved plan context |
playbook_versions |
{routing: 3, vertical: 2, platform.meta: 5, qc: 4} |
Which rule packs were active |
context_refs[] |
[{type: brief, id: b_7, v: 2}, {type: opt_log, id: cs_44}] |
Inputs/docs by reference — not full text in BQ |
input_slice_hash, main_output_hash |
SHA-256 | Dedup / join without PII |
input_slice_uri |
gs://…/runs/run_abc/step_3_input.json |
Full structured input in GCS; BQ holds URI only |
qc_feedback_uri |
gs://…/runs/run_abc/step_3_qc.json |
Checker JSON (failed_checks, suggested_fixes) |
latency_ms, input_tokens, output_tokens |
from usage_metadata |
Per-step cost attribution |
Emit on each correction iteration (agent.qc.loop):
| Field | Purpose |
|---|---|
correction_number |
1 or 2 (max per loop policy) |
main_task_id, qc_task_id |
Same pairing as above |
delta_applied |
Structured diff summary from main agent fix (fields changed) |
prior_failure_codes[] |
What QC complained about before fix |
post_fix_outcome |
pass | fail on re-check |
On A8 exhaustion, agent.loop.exhausted includes last_qc_failure plus qc_loop_trace_id → full ordered list of agent.qc.result / agent.qc.loop rows for that run_id.
BigQuery tables (partitioned by event_date, clustered by tenant_id, main_task_id):
| Table | Grain | Use |
|---|---|---|
agent_qc_results |
One row per QC invocation | Fail-rate by agent, model, task, platform |
agent_qc_loops |
One row per correction attempt | Which failure codes repeat after fix |
agent_qc_failures_rollup |
Daily materialized view | Top offenders, playbook version regressions |
Example ops queries (illustrative):
- QC fail rate by
main_agent+model_main(rolling 7d) - Top
failure_codesforqc.complianceonvertical.health - Runs where
playbook_versions.qcchanged and fail rate spiked same day opt.cycleloops wherecontext_refsinclude stalemanifest_slice_id
Dashboard (System Ops) — QC health and statistics: leaderboard, model breakdown, drill-down to input_slice_uri / qc_feedback_uri. Human Touch shows breach alerts only, not the statistics console.
Feeds automatic model promotion — rolling 24h fail rate per main_task_id from agent_qc_results drives model promotion / demotion; no manual spreadsheet.
QC success threshold alerts (80% floor)
A deterministic alert job (not an agent) evaluates rollups from agent_qc_results / agent_qc_failures_rollup and fires when a task or subtask cannot maintain the 80% success floor. Below that floor means the pairing is under-performing — prompts, models, playbooks, or context must be optimized, not ignored.
Scope grains (evaluated independently):
| Grain | Key | Example |
|---|---|---|
| Task | main_task_id |
opt.cycle across all tenants |
| Subtask | main_task_id + qc_task_id + main_agent |
exec.campaign.build + qc.compliance on execution.meta |
| Tenant slice (optional) | above + tenant_id |
One client’s plan.revise failing |
Metrics (both computed per grain, rolling window default 24h):
| Metric | Formula | Why |
|---|---|---|
| First-pass rate | count(attempt_number=0 AND outcome=pass) / count(distinct run_id) |
Primary signal — cheap runs stay cheap |
| Eventual success rate | count(run ended pass within loop budget) / count(distinct run_id) |
Catches fixable vs broken pairs |
Alert rule (default):
if sample_count >= min_sample
AND (first_pass_rate < 0.80 OR eventual_success_rate < 0.80):
emit agent.qc.threshold.breached
open optimization ticket (engineering — not client HITL)
run auto-optimization actions (below)
| Parameter | Default | Notes |
|---|---|---|
success_floor |
0.80 (80%) | Tenant may be stricter; not looser without admin |
window |
24h rolling | Also compute 7d trend for dashboard |
min_sample |
20 runs / grain / window | 10 if tenant-only slice; suppress alert if below (noise) |
On breach (agent.qc.threshold.breached) — payload includes grain, main_task_id, qc_task_id, main_agent, first_pass_rate, eventual_success_rate, sample_count, top_failure_codes[], model_breakdown, playbook_versions_mode, sample_run_ids[].
| Action | Automatic? | Purpose |
|---|---|---|
| Ops alert | Yes | Slack / email / PagerDuty — link to System Ops QC panel |
| Dashboard badge | Yes | Red row on System Ops leaderboard until rate recovers ≥80% for 24h |
| Model tier promotion | Yes — immediate | Skip normal 24h wait; promote one tier (see model promotion) |
| Thinking level bump | Yes | e.g. medium → high on model_main for that task class |
| Engineering ticket | Yes | Bundle: top failure codes, playbook versions, 3× input_slice_uri / qc_feedback_uri |
| Playbook review flag | Yes | Pin playbook.qc / playbook.platform.* version for human diff |
| Disable QC gate | Never | Quality floor is non-negotiable |
| Client HITL (A1–A9) | No | Internal ops / engineering only unless client-visible task is blocked fleet-wide |
Recovery: alert clears when the same grain stays ≥80% for a full 24h cooldown window (hysteresis — avoids flap).
Optimize what? Use breach payload to pick the lever:
| Dominant signal | Likely fix |
|---|---|
High failure_codes on one check |
Update playbook.qc checklist or move check to code |
One model_main much worse than others |
Promote tier or change default routing |
Spike after playbook_versions bump |
Roll back or patch playbook; regression test |
| High correction count, low first-pass | Tighten main-agent prompt / context budget; raise thinking |
| One tenant only | Tenant brief / manifest data issue — ops contacts AM |
# Illustrative — implementation phase
qc_telemetry_policy:
log_every_qc_invocation: true
store_full_input_in_gcs: true # BQ = refs + hashes only
retention_days_bq: 400
retention_days_gcs: 90 # extend on A8 / compliance hold
rollup_views: [daily, weekly]
qc_threshold_policy:
success_floor: 0.80 # alert below 80%
window_hours: 24
min_sample_global: 20
min_sample_tenant: 10
recovery_cooldown_hours: 24
auto_promote_on_breach: true
auto_bump_thinking_on_breach: true
Cost Guard — deterministic spend circuit breaker (not AI)
A separate deterministic service — not an agent, not LLM-mediated — sits in front of every Vertex call and enforces hard spend limits per run_id. It uses the per-task cost catalog and a versioned pricing table (model → $/1M input/output) loaded from config. No model chooses whether to stop; the math does.
Why: loop caps limit iterations but not token blowups within a step (thinking tokens, tool-loop context re-send). Cost Guard catches runaway spend even when loop counts are still legal.
| Component | Role | AI? |
|---|---|---|
| Cost Guard service | Pre-check + post-check on every LLM invocation | No — pure code |
| Pricing table | model_id → input/output $/1M (mirrors Vertex list prices) |
No |
| Estimate resolver | Maps task_id + planned steps → estimated_cost_usd for run_id |
No — reads catalog / composite units |
| Run cost ledger | run_id → {estimated, actual, model_breakdown[]} in Firestore or Redis |
No |
| Kill switch | Blocks new Vertex calls; signals orchestrator to abort agents | No |
Single cost formula (same for catalog planning, run budget, and live metering):
step_cost = (input_tokens / 1e6 × price_in[model])
+ (output_tokens / 1e6 × price_out[model]) # output includes thinking tokens
Only the token source differs between estimate and actual:
| Phase | When | Token source | Purpose |
|---|---|---|---|
| Estimate | Run start, before any Vertex call | Catalog In tok / Out tok per planned task_id (see cost catalog) |
Budget ceiling for run_id |
| Actual | After every Vertex API response | usage_metadata on the response — never agent self-report |
Running spend tally |
Estimate at run start (when router dispatches a gated task):
# Per planned step — same formula as catalog "Cost / run" column
step_estimate = (catalog[task].in_tok / 1e6 × price_in[model])
+ (catalog[task].out_tok / 1e6 × price_out[model])
estimated_run_cost = Σ step_estimate over planned steps # e.g. opt.cycle + qc.spend
× loop_buffer # default 1.08 for gated tasks
catalog[task].expected_cost in the table is exactly this math pre-computed; Cost Guard may read either the USD column or recompute from In/Out + pricing_table_version — they must match.
Stored on run_id before the first LLM call. Estimate is immutable for that run unless admin resets (A6).
Actual after each LLM response — Vertex returns token counts on every generate call:
# Map usage_metadata fields (names vary slightly by SDK; normalize in Cost Guard)
input_tokens = prompt_token_count
+ cached_content_token_count # billed at cached input rate if applicable
output_tokens = candidates_token_count
+ thoughts_token_count # 3.5 Flash thinking — billed at output rate
step_cost = (input_tokens / 1e6 × price_in[model])
+ (output_tokens / 1e6 × price_out[model])
actual_run_cost += step_cost
Cost Guard logs per step: run_id, step_index, model, input_tokens, output_tokens, step_cost_usd, actual_run_cost_usd, estimated_run_cost_usd, ratio. Ledger and BigQuery use API-reported tokens only — never prompt guesses or agent claims.
Optional pre-call check (still deterministic): before forwarding a request, Cost Guard may count input tokens in the outbound payload (Vertex tokenizer or count_tokens API) to warn if a single call's prompt alone exceeds 2× catalog.in_tok for that step — does not replace the 3× run-level trip; catches context blowups early.
Trip rule (default):
if actual_run_cost >= trip_multiplier × estimated_run_cost:
TERMINATE # default trip_multiplier = 3.0
| Scope | Default trip_multiplier |
On trip |
|---|---|---|
Per run_id |
3.0× estimated | Kill run; A9 ticket; ops alert |
| Per tenant / calendar day (optional cap) | Admin-set USD ceiling | Pause all agent LLM calls for tenant until next day or override |
| Per environment (dev/staging) | Stricter (e.g. 2.0×) | Kill + alert engineering |
On terminate (cost.guard.tripped):
- Hard stop — orchestrator cancels in-flight agent work; Cost Guard rejects subsequent Vertex requests for that
run_idwith403 cost_guard_tripped - Rollback policy — same as A8: rollback or mark
needs_reviewfor any partial platform mutations - HITL ticket (A9) — distinct from A8 (loop exhaustion): shows estimated vs actual USD, per-step token breakdown, model used, trip multiplier
- Ops alert — red notification to on-call with tenant,
run_id, task_id,actual / estimatedratio - No auto-restart — new
run_idrequires human acknowledgment; optional admin may raise estimate ceiling (A6) with audit log - Emit
cost.guard.trippedon event bus
Cost Guard cannot be bypassed by agents, prompts, or orchestrator retries. Only admin override (A6) with logged reason may raise the multiplier or authorize a new run with a higher estimate.
# Illustrative — implementation phase
cost_guard_policy:
trip_multiplier: 3.0 # stop at 3× estimate
loop_buffer_in_estimate: 1.08 # baked into estimated_run_cost
pricing_table_version: "2026-06" # must match catalog revision
tenant_daily_cap_usd: null # optional; e.g. 50.00
block_on_trip: true # always true in prod
What monthly costs include
Per-task prices in the cost catalog are single successful runs. Monthly totals explicitly count main + QC as separate line items (e.g. opt.cycle + qc.spend).
Correction loops are budgeted separately:
| Assumption | Value | Cost impact |
|---|---|---|
| QC fail rate (steady state) | ~5–8% with 3.5 Flash QC (was ~8–12% on Flash-Lite) | — |
| Avg correction loops when failed | 1.1 (most pass on 1st retry) | +~6–8% on gated-task LLM spend |
| Compliance escalation | ~2% of compliance QC runs | Already line-itemed in Profile B |
Composite unit examples (main + QC + expected loop overhead):
| Composite | Formula | Typical cost / unit |
|---|---|---|
| Optimization cycle (gated, per track) | opt.cycle + qc.spend + 8% loop buffer |
~$0.15 |
| Campaign build (gated) | exec.campaign.build + qc.compliance + qc.tracking + 8% buffer |
~$0.26 |
| Master / full replan | plan.draft + qc.plan + 8% buffer |
~$0.28 |
| New track (branding, engagement) | plan.track.draft + qc.plan + 8% buffer |
~$0.20 |
| Special-day event plan | plan.event.draft + qc.plan + 8% buffer |
~$0.15 |
| Per-track revise | plan.revise + qc.plan + 8% buffer |
~$0.22 |
Monthly profile totals use separate main/QC counts; add ~6–8% for correction-loop overhead on gated tasks (lower with 3.5 QC), or use composite units above when estimating.
Event bus (conceptual)
Async events (illustrative names):
| Event | Producer | Consumers |
|---|---|---|
client.onboarding.completed |
Onboarding Service | Orchestrator, Planning |
plan.approved |
Approval Engine | Orchestrator, Execution |
optimization.applied |
Optimization Service | Reporting, Plan Revise (drift check) |
plan.revise.recommended |
Optimization / Reporting | Plan Revise Agent, HITL dashboard |
plan.revise.approved |
Approval Engine | Execution, Reporting (rebaseline) |
campaign.live |
Execution Service | Optimization, Reporting |
ga4.tracking.degraded |
Tracking Service | Orchestrator (block spend) |
crm.conversion.batch |
CRM Connector | Tracking Service (CAPI) |
agent.qc.result |
Orchestrator (on every QC pass/fail) | BigQuery agent_qc_results, System Ops QC panel |
agent.qc.loop |
Orchestrator (on each correction attempt) | BigQuery agent_qc_loops, model promotion policy |
agent.qc.threshold.breached |
QC threshold alert job (scheduled) | Ops alerting, auto-optimize, engineering ticket |
agent.loop.exhausted |
Orchestrator | Human Touch ticket, System Ops traces, ops alerting, audit log |
cost.guard.tripped |
Cost Guard service | Orchestrator kill, HITL (A9), ops alerting, audit log |
Implementation may use Pub/Sub, Kafka, or equivalent — TBD with stack.
Model strategy and routing
Principles
- Capable, never cheap-at-all-costs — sub-agents use the cheapest tier that still passes QC for that task class. If QC failure rate rises, auto-promote tier (see playbook §4).
- Gemini-first on Vertex — native tool calling, context caching, Agent Engine integration, single billing/IAM surface.
gemini-3.5-flashas the primary agentic surface — GA model built for agentic execution, tool loops, and sub-agent deployment. Prefer one capable model +thinking_levelper task over mixing many smaller models for agent work. See 3.5 Flash thinking matrix.- Compliance QC on Gemini 3.5 Flash — default compliance gate uses
gemini-3.5-flashathighthinking: GA, strong reasoning, structured policy checks, ~50% cheaper than Sonnet on Vertex. Cross-model is optional escalation, not the default (see below). - Cross-model escalation (optional) —
claude-sonnet-4-6only when compliance QC is ambiguous (low confidence), vertical risk flag is health / education, or a human requests a second opinion. Reduces cost while keeping an independent check for edge cases. - Opus / Pro escalation only —
claude-opus-4-8orgemini-3.1-pro-previewwith extended thinking reserved for human-triggered or router-scored "hard" tasks. - No autonomous loops on Lite-only models —
gemini-2.5-flash-liteis allowed only for deterministic structured I/O (JSON transform, field mapping) with schema validation; not for multi-step tool chains.
Model tiers (summary)
| Tier | Models | Use when |
|---|---|---|
| T0 — Fast / volume | gemini-3.1-flash-lite |
Routing, classification, batch report digests only — not gated QC or tool loops |
| T1 — Agentic workhorse | gemini-3.5-flash (default), gemini-2.5-flash (fallback) |
All tool loops, sub-agents, gated QC checkers, optimization, execution |
| T2 — Planning / orchestration | gemini-3.1-pro-preview, gemini-3.5-flash (high), gemini-2.5-pro (fallback) |
Media plans, orchestrator escalations — Pro when 3.5 high still fails QC repeatedly |
| T3 — Cross-model escalation | claude-sonnet-4-6, claude-haiku-4-5 |
Ambiguous compliance, high-risk verticals, human-requested second opinion |
| T4 — Rare escalation | claude-opus-4-8 |
Human-requested deep reasoning only |
| T-batch — Async digest | gemini-3.1-flash-lite Batch, llama-3.3-70b Batch |
Scheduled report summarization, log digests |
3.5 Flash thinking matrix (recommended default)
Gemini 3.5 Flash is more capable than 3.1 Flash-Lite and 3 Flash Preview for agentic work: GA, stronger tool use, thought preservation across turns, and near-Pro quality at Flash-tier cost. Google positions it explicitly for sub-agent deployment and rapid agentic loops.
Does using more 3.5 with different thinking_level per task make sense? Yes — for any task that calls tools, blocks mutations, or runs a QC correction loop. Use one endpoint, many depths instead of jumping to Pro or mixing preview models. Reserve Flash-Lite for high-volume non-agentic routing and batch digests only.
| Task | Model | thinking_level |
Why |
|---|---|---|---|
| Router / classify | 3.1 Flash-Lite | minimal |
Pure routing — no tools; millions of calls |
| Onboarding sub-steps | 3.5 Flash | medium |
Multi-step setup, platform APIs |
| Execution / optimization sub-agents | 3.5 Flash | medium |
Tool loops; default balances quality + latency |
| Complex campaign build (multi-platform) | 3.5 Flash | high |
More constraints; fewer correction loops |
| QC — plan validator | 3.5 Flash | high |
Must catch budget/channel errors before human approval |
| QC — compliance | 3.5 Flash | high |
Policy + claims reasoning |
| QC — tracking health | 3.5 Flash | low |
Mostly checklist; fast pass/fail |
| QC — spend guardrails | 3.5 Flash | medium |
Numeric compare vs plan — code validates math; LLM interprets intent |
| Report anomaly triage | 3.5 Flash | medium |
Needs reasoning; not mutation-blocking |
| Daily / weekly digest | 3.1 Flash-Lite Batch | low |
Read-only narrative; cost-sensitive |
| Media plan draft | 3.1 Pro Preview | (Pro thinking) | Long-horizon strategy — 3.5 high is fallback if Pro unavailable |
| Orchestrator escalation | 3.1 Pro Preview | (Pro thinking) | Conflict resolution across agents |
Quality vs cost tradeoff: Putting gated QC on 3.5 Flash (vs a cheaper Flash-Lite QC mix) raises LLM spend by roughly $4–6 / client / month on Standard, but buys fewer correction loops (8% buffer vs ~12%) and a higher first-pass QC rate — usually net-positive once human-review time is counted. Do not run the router or batch digests on 3.5; that would 3–6× those rows for no quality gain. Expected Standard total is **$25.60 / client / month** (see cost catalog and cost band).
Thinking levels (gemini-3.1-flash-lite only)
For T0 tasks on Flash-Lite per Google's guidance:
| Level | Agent tasks |
|---|---|
minimal |
Router, intent classification |
low |
Batch report KPI bullets, simple field extraction |
Model catalog (Vertex AI, June 2026)
Source: Vertex AI Generative AI pricing — Global endpoint, standard (non-batch) rates per 1M tokens, prompts ≤200K. Cached input shown where available. Verify before implementation; preview models may change.
Primary — Gemini (Google)
| Model | Input / 1M | Output / 1M | Cached input / 1M | Designed for | Kobi use |
|---|---|---|---|---|---|
| gemini-3.1-flash-lite (GA) | $0.25 | $1.50 | $0.025 | High-volume agentic workflows, tool calling, routing (docs) | Default sub-agent + router + checklist QC |
| gemini-3.5-flash (GA) | $1.50 | $9.00 | $0.15 | Frontier-level Flash; agentic execution, coding, long-horizon tool use (docs) | Compliance QC, Execution, Optimization (preferred GA workhorse) |
| gemini-3-flash-preview | $0.50 | $3.00 | $0.05 | Agentic workhorse, multimodal, computer use (docs) | Fallback if 3.5 unavailable; Onboarding |
| gemini-3.1-pro-preview | $2.00 | $12.00 | $0.20 | Deep agentic reasoning, precise tool use, 1M context (docs) | Orchestrator, Media Plan |
| gemini-2.5-flash (GA) | $0.30 | $2.50 | $0.03 | Stable agentic GA; improved tool use (Google blog) | Production fallback for T1 |
| gemini-2.5-pro (GA) | $1.25 | $10.00 | $0.13 | Complex reasoning, coding | Production fallback for T2 |
| gemini-2.5-flash-lite (GA) | $0.10 | $0.40 | $0.01 | Ultra-cheap classification/summarization | Structured transform only — not autonomous tool loops |
Secondary — Anthropic on Vertex (QC & cross-check)
| Model | Input / 1M | Output / 1M | Designed for | Kobi use |
|---|---|---|---|---|
| claude-haiku-4.5 | $1.00 | $5.00 | Fast, cost-effective classification (Anthropic pricing) | Lightweight policy JSON scans |
| claude-sonnet-4.6 | $3.00 | $15.00 | Agents, coding, enterprise workflows at scale | Compliance escalation — ambiguous QC, health/education high-risk, human-requested |
| claude-opus-4.8 | $5.00 | $25.00 | Hardest agentic + coding tasks | Human-triggered escalation only |
Optional — Meta Llama on Vertex (batch / cost experiments)
| Model | Input / 1M | Output / 1M | Designed for | Kobi use |
|---|---|---|---|---|
| llama-3.3-70b | $0.72 | $0.72 | Efficient text tasks | Batch report narrative (optional) |
| llama-4-maverick | $0.35 | $1.15 | Multimodal reasoning, tool calling | Creative brief / asset review (optional) |
Models we do not default to
| Model | Reason |
|---|---|
| Grok, Nemotron, Mistral OCR, TTS, embedding-only | Wrong task fit or no advantage over Gemini/Claude for media-ops agents |
gemini-2.5-flash-lite for tool loops |
Weak agentic benchmarks (SWE-bench ~32%); acceptable only with strict JSON schema + no tools |
| Opus / 3.1 Pro for sub-agents | 10–50× token cost vs Flash-Lite with no QC benefit on simple tasks |
Per-task LLM cost catalog
Vertex global standard pricing. Cost formula:
cost = (input_tokens / 1M × input_rate) + (output_tokens / 1M × output_rate)
Token assumptions are planning defaults for a single successful run of that step. 3.5 Flash output column includes thinking tokens billed at output rate. QC checker runs are separate line items — gated main tasks always incur main + QC (see pairing table). Correction loops add ~6–8% on gated tasks with 3.5 QC (see loop limits). Context caching on system prompts typically reduces input cost 30–50% in steady state — figures below are without cache (conservative).
| Task ID | Task | Agent / layer | Model | Thinking | In tok | Out tok† | Cost / run |
|---|---|---|---|---|---|---|---|
route.dispatch |
Route Pub/Sub event to agent + playbook | Router | 3.1 Flash-Lite | minimal | 3K | 0.5K | $0.0015 |
route.classify |
Classify complexity + model tier | Router | 3.1 Flash-Lite | low | 5K | 0.8K | $0.0025 |
onboard.step |
One onboarding sub-step (account, tag, verify) | Onboarding sub | 3.5 Flash | medium | 12K | 4.5K | $0.059 |
plan.draft |
Master plan or full replan (envelope + tracks) | Media Plan | 3.1 Pro Preview | — | 30K | 10K | $0.18 |
plan.track.draft |
New track (branding, engagement, always-on split) | Media Plan | 3.1 Pro Preview | — | 20K | 6K | $0.11 |
plan.event.draft |
Special-day / event flight (time-boxed) | Media Plan | 3.5 Flash | medium | 14K | 4.5K | $0.062 |
plan.revise |
Per-track revise vN+1 from opt log + manifest | Plan Revise | 3.1 Pro Preview | — | 22K | 7K | $0.13 |
report.plan_drift |
Per-track drift + changelog (may batch tracks) | Reporting | 3.5 Flash | low | 25K | 4K | $0.074 |
qc.plan |
Plan validator vs brief | QC | 3.5 Flash | high | 18K | 5.5K | $0.077 |
qc.compliance |
Creative / copy policy gate | QC | 3.5 Flash | high | 12K | 4K | $0.054 |
qc.compliance.escalate |
Cross-model second opinion (rare) | QC | Sonnet 4.6 | — | 12K | 2.5K | $0.074 |
qc.tracking |
Tracking health sweep | QC | 3.5 Flash | low | 10K | 2K | $0.033 |
qc.spend |
Spend guardrail vs approved plan | QC | 3.5 Flash | medium | 10K | 2.5K | $0.038 |
exec.campaign.build |
Build one platform campaign slice | Execution sub | 3.5 Flash | medium | 40K | 10K | $0.15 |
exec.campaign.mutate |
Pause, budget nudge, status change | Execution sub | 3.5 Flash | medium | 15K | 4K | $0.059 |
opt.cycle |
Analyze performance + propose change | Optimization | 3.5 Flash | medium | 22K | 7.5K | $0.101 |
feed.validate |
Feed / catalog batch validation | Onboarding / Execution sub | 3.5 Flash | low | 8K | 2K | $0.030 |
report.daily |
Daily KPI digest | Reporting | 3.1 Flash-Lite Batch | low | 35K | 4K | $0.007 |
report.weekly |
Weekly client narrative | Reporting | 3.1 Flash-Lite | low | 45K | 6K | $0.020 |
report.anomaly |
Anomaly triage + recommendation | Reporting | 3.5 Flash | medium | 28K | 9K | $0.123 |
orch.escalate |
Orchestrator re-plan / conflict resolution | Orchestrator | 3.1 Pro Preview | — | 20K | 5K | $0.10 |
† For gemini-3.5-flash, Out tok = visible output + thinking tokens (billed at output rate per Vertex pricing).
Batch API for scheduled / non-urgent tasks
Vertex Batch API = 50% discount on eligible Gemini models, in exchange for async completion (up to ~24h). Use it only for scheduled, non-blocking work — never for real-time routing, gated QC, platform mutations, interactive planning, or anomaly alerts (<1h SLA).
| Task | Standard / run | Batch / run (−50%) | Eligible? |
|---|---|---|---|
report.daily |
$0.007 | $0.007 | ✓ already batched |
report.weekly |
$0.020 | $0.010 | ✓ scheduled |
report.plan_drift |
$0.074 | $0.037 | ✓ scheduled (per-track rollup) |
feed.validate (scheduled catalog sweep) |
$0.030 | $0.015 | ✓ when not pre-launch gating |
report.monthly (exec summary) |
~$0.15 | ~$0.075 | ✓ scheduled |
route.*, plan.*, qc.*, exec.*, opt.cycle, report.anomaly, orch.escalate |
— | — | ✗ real-time / gated / interactive |
Reality check: batch trims the reporting + feed-sweep slice only (~2–4% of total). The real cost drivers — optimization cycles, execution mutations, QC gates — are real-time and stay at standard rates. Batch is worthwhile (free 50% on eligible tasks) but is not where the savings concentrate; context caching matters far more.
Monthly task volume & LLM cost by client profile
Illustrative steady-state volumes (after onboarding month). Costs assume multiple concurrent plan tracks per client (always-on, branding, engagement, special-day events) — not a single monolithic plan. See plan tracks.
QC pairing: each plan.* row includes + qc.plan ($0.077) unless noted.
Profile A — Starter (Google + Meta, low activity)
~$15–25K/mo media spend, 2 active tracks (always-on + seasonal/event), optimization 2–3×/week.
| Task ID | Runs / month | Unit cost (incl. QC where gated) | Monthly |
|---|---|---|---|
route.dispatch |
320 | $0.0015 | $0.48 |
route.classify |
110 | $0.0025 | $0.28 |
| Planning (multi-track) | |||
plan.draft master refresh |
0.25 | $0.257 | $0.06 |
plan.track.draft + qc.plan |
0.5 | $0.187 | $0.09 |
plan.event.draft + qc.plan |
1 | $0.139 | $0.14 |
plan.revise + qc.plan |
2 | $0.207 | $0.41 |
report.plan_drift (2 tracks) |
4 | $0.074 | $0.30 |
qc.compliance |
5 | $0.054 | $0.27 |
exec.campaign.build |
6 | $0.15 | $0.90 |
exec.campaign.mutate |
4 | $0.059 | $0.24 |
opt.cycle + qc.spend |
24 + 24 | $0.139 | $3.34 |
qc.tracking |
30 | $0.033 | $0.99 |
report.daily + report.weekly |
30 + 4 | — | $0.29 |
report.anomaly |
2 | $0.123 | $0.25 |
| Correction-loop overhead (~8% of gated) | — | — | ~$0.52 |
| LLM subtotal (standard) | ~$8.55 |
- With context caching (
40% input savings): **$6.40–6.90 / client / month** - With caching + batch on scheduled reports (saves
$0.19): **$6.25–6.70 / client / month**
Profile B — Standard (Google + Meta + GA4 + CRM, active optimization)
~$50–150K/mo media spend, 3–4 active tracks (always-on, branding, engagement, seasonal), daily optimization.
| Task ID | Runs / month | Unit cost (incl. QC where gated) | Monthly |
|---|---|---|---|
route.dispatch |
700 | $0.0015 | $1.05 |
route.classify |
240 | $0.0025 | $0.60 |
| Planning (multi-track) | |||
plan.draft master refresh |
0.25 | $0.257 | $0.06 |
plan.track.draft + qc.plan |
1 | $0.187 | $0.19 |
plan.event.draft + qc.plan |
2 | $0.139 | $0.28 |
plan.revise + qc.plan |
4 | $0.207 | $0.83 |
report.plan_drift (3–4 tracks) |
8 | $0.074 | $0.59 |
qc.compliance |
14 | $0.054 | $0.76 |
exec.campaign.build |
22 | $0.15 | $3.30 |
exec.campaign.mutate |
14 | $0.059 | $0.83 |
opt.cycle + qc.spend |
95 + 95 | $0.139 | $13.21 |
qc.tracking |
30 | $0.033 | $0.99 |
report.daily + report.weekly |
30 + 4 | — | $0.29 |
report.anomaly |
6 | $0.123 | $0.74 |
orch.escalate |
2 | $0.10 | $0.20 |
qc.compliance.escalate |
1 | $0.074 | $0.07 |
| Correction-loop overhead (~8% of gated) | — | — | ~$1.64 |
| LLM subtotal (standard) | ~$25.60 |
- With context caching (
40% input savings): **$19–22 / client / month** - With caching + batch on scheduled reports (saves
$0.34): **$18.70–21.70 / client / month**
Profile C — Ecommerce (feeds, catalog, high optimization)
~$150K+/mo media spend, 4–6 active tracks (+ flash sales, catalog pushes), hourly pacing.
| Task ID | Runs / month | Unit cost (incl. QC where gated) | Monthly |
|---|---|---|---|
| Profile B planning + ops base | — | — | $25.60 |
| Extra planning (events / tracks) | |||
plan.event.draft + qc.plan |
+2 | $0.139 | +$0.28 |
plan.revise + qc.plan |
+2 | $0.207 | +$0.41 |
plan.track.draft + qc.plan |
+0.5 | $0.187 | +$0.09 |
report.plan_drift |
+4 | $0.074 | +$0.30 |
feed.validate |
30 | $0.030 | $0.90 |
Extra opt.cycle + qc.spend |
+45 +45 | $0.139 | +$6.26 |
Extra exec.campaign.build |
+12 | $0.15 | +$1.80 |
Extra qc.compliance |
+6 | $0.054 | +$0.32 |
Extra route.dispatch |
+250 | $0.0015 | +$0.38 |
| Correction-loop overhead (extra gated) | — | — | ~$0.80 |
| LLM subtotal (standard) | ~$37.15 |
- With context caching (
40% input savings): **$27–31 / client / month** - With caching + batch on scheduled reports + feed sweeps (saves
$0.93): **$26.50–30 / client / month**
Onboarding month (one-time, any profile)
| Task ID | Runs (typical) | Unit cost | One-time |
|---|---|---|---|
onboard.step |
18–25 | $0.059 | $1.05–1.48 |
route.dispatch + route.classify |
80 + 40 | $0.0015 + $0.0025 | $0.22 |
qc.tracking (setup) |
10 | $0.033 | $0.33 |
| Onboarding LLM | ~$1.60–2.00 |
Portfolio-level estimate
| Portfolio size | Profile mix | Standard (no cache) | + caching | + caching & batch |
|---|---|---|---|---|
| 10 clients | 6 Starter + 4 Standard | ~$154 | ~$120 | ~$118 |
| 50 clients | 25 + 20 + 5 Ecommerce | ~$912 | ~$712 | ~$707 |
| 200 clients | 100 + 80 + 20 | ~$3,650 | ~$2,845 | ~$2,830 |
Assumes multi-track planning volumes above; actuals scale with count of active event/sale plans per client. Batch column reflects scheduled reporting + feed sweeps only — the marginal step beyond caching is small because cost concentrates in real-time optimization, execution, and QC.
Note: These are Vertex LLM inference only — not media spend, Kobi management fees, Cloud Run, BigQuery, or platform API costs. Log actual
run_idtoken usage to BigQuery pertenant_idfor invoice-grade allocation if pass-through is contracted. See Billing & invoicing for client-facing monthly invoices (media spend + fees).
Realistic cost band (treat point estimates as the expected case)
The per-task tokens above are single-shot, expected-case assumptions. Two factors can push real cost up materially; size budgets with a band, not a point:
| Factor | Effect | Multiplier on affected tasks |
|---|---|---|
Thinking tokens at high |
3.5 Flash high can emit far more reasoning than the 4–5.5K assumed for qc.plan / qc.compliance / multi-constraint builds |
output tokens ×1.5–2.5 |
| Tool-loop context re-send | Each tool round re-sends growing context; catalog prices exec.* / opt.cycle as a single shot |
input tokens ×2–4 over 3–5 rounds |
| Transient retries | Orchestrator retry (1×) + occasional platform re-fetch | +5–10% on tool tasks |
| Profile | Low (caching + batch) | Expected (standard) | High (thinking + tool-loops) |
|---|---|---|---|
| Starter | ~$6.40 | ~$8.55 | ~$15–18 |
| Standard | ~$19 | ~$25.60 | ~$48–55 |
| Ecommerce | ~$27 | ~$37.15 | ~$72–82 |
Even the high case stays a tiny fraction of media spend and Kobi fees — but design for the high band, then measure real run_id usage in a pilot to replace these assumptions.
Actual cost is dominated by optimization frequency and multi-turn tool loops — the playbook below targets both.
Token-efficiency playbook
Goal: maximum quality per token — structured outputs, cached context, minimal re-prompting.
1. Prompt architecture
| Rule | Implementation |
|---|---|
| Split system vs task | Stable system prompt + tool schemas → Vertex context cache (refresh only on playbook version bump) |
| Structured outputs only | Agents return JSON matching versioned schemas (plan_vN, mutation_manifest, qc_result) — no prose in tool paths |
| Reference by ID | Pass tenant_id, plan_version, entity_id — never re-embed full plan in every sub-agent call; sub-agents fetch slice via tool |
| One job per sub-agent | e.g. meta.campaign.create not meta.everything — smaller context, fewer hallucinated side effects |
2. Context budget
| Context type | Max tokens (target) | Cache? |
|---|---|---|
| System + tool schemas | ≤ 8K | Yes |
| Tenant playbook snippet | ≤ 2K | Yes |
| Task payload (this step only) | ≤ 4K | No |
| Platform API response (trimmed) | ≤ 6K | No |
| Total per sub-agent call | ≤ 20K | — |
Trim platform API responses to fields the schema requires. Store full responses in GCS/BQ; pass URI to reporting agents only.
3. Tool-call discipline
- Plan tools explicitly — max 3 tools per sub-agent invocation unless router scores
complex. - Idempotent tools — safe to retry; orchestrator dedupes by
run_id. - No LLM in hot path for math — budget splits, bid deltas, % thresholds computed in code; LLM proposes intent, code validates numbers.
- Batch non-urgent work — route all scheduled, non-blocking tasks (daily/weekly/monthly digests, per-track drift reports, scheduled feed sweeps) through Vertex Batch API for a 50% discount on eligible Gemini models. Never batch real-time routing, gated QC, mutations, or anomaly alerts. See Batch API for scheduled tasks.
4. Model promotion / demotion (automatic)
| Signal | Action |
|---|---|
| QC fail rate > 5% for task class (rolling 24h) | Promote task class one tier (e.g. Flash-Lite → 3 Flash) |
| QC success rate < 80% (task or subtask grain) | Alert + immediate promote + thinking bump — see QC success threshold alerts |
| QC pass rate > 99% for 7d on task class | Trial demote one tier; revert if fails spike |
| p95 latency > SLA | Lower thinking level or switch to Flash-Lite |
| Human escalation on reasoning | Pin task class to T2+ for 30d |
5. Playbook registry (per tenant / vertical)
Versioned artifacts cached in Vertex context cache:
| Playbook | Contents | Changes when |
|---|---|---|
playbook.routing |
Task class → model tier + thinking level | Model catalog update |
playbook.vertical.{health|school|…} |
Compliance keywords, blocked claims, KPI defaults | Vertical config change |
playbook.platform.{google|meta|…} |
Tool allowlist, API field maps, rate-limit hints | Platform spec update |
playbook.qc |
QC agent checklist templates | Policy change |
Router selects playbook IDs; agents never inline full vertical rules in every prompt.
6. Output quality without extra tokens
- Two-pass by design: every gated main task → paired QC checker (see §5). Max 2 correction loops, then human. No third QC pass unless compliance escalates to Sonnet.
- Confidence gate: if router confidence < 0.85 on classification, escalate to
mediumthinking or T1 model — cheaper than a failed tool loop + retry. - Log token usage per
run_id,agent,model,thinking_level→ BigQuery for cost attribution per tenant. - Log every QC loop —
agent.qc.result+agent.qc.loopwith agent, model, task, playbook versions, and input/doc refs (see QC loop telemetry).
Failure handling
| Failure | Behavior |
|---|---|
| Platform API rate limit | Exponential backoff; partial apply with rollback marker |
| Agent timeout | Orchestrator retries once; then red flag (A8) if still failing |
| Loop limit exceeded | Red flag (A8) + agent.loop.exhausted; no further auto-retry |
| Cost Guard trip (≥3× estimate) | Terminate run_id; A9 + cost.guard.tripped; all Vertex calls blocked |
| Guardrail violation | Block action; create approval ticket |
| Tracking down | Pause optimization spend increases; alert ops |
| Repeated A8 on same task class (≥3 in 24h) | Escalate to engineering + optional tenant pause |
| Repeated A9 on same task class (≥3 in 24h) | Review pricing table / estimate resolver; tighten trip_multiplier or task catalog |
| QC success < 80% (task/subtask, min sample) | agent.qc.threshold.breached → ops alert + auto-optimize; engineering ticket |