Architecture · Draft

Agentic Orchestration Model

Created 9 Jun 2026·Updated 11 Jun 2026

Latest change: Publish Dossier site and full doc pack to GitHub

Draft document — deep-dive spec incomplete; content will be updated before and during build. Do not treat as signed-off implementation detail. Pack overview

Overview

Digital media operations decompose into specialized agents supervised by a central orchestrator. Agents call domain services and platform connectors; they do not hold long-lived credentials directly — services inject scoped tokens per request.

Deployment: agents run on Vertex AI Agent Engine; models are served from Vertex AI Model Garden (see GCP deployment topology — AI & agent runtime). Cloud Run hosts the orchestrator control plane, connectors, and HITL API.

Agent hierarchy

Three layers — each with a defined model tier (see Model catalog):

Orchestrator AgentOnboarding AgentMedia Plan AgentExecution AgentOptimization AgentReporting AgentSub: verification / feed /tracking setupSub: channel mix / budgetsplit / creative briefSub: per-platform builders Google / Meta / TikTok /DV360Sub: bid rules / audience /creative rotationSub: anomaly detect /narrative / CRM reconcileQC: plan validatorQC: compliance & policyQC: tracking healthQC: spend guardrailverifierRouter / classifier
Layer Agents Responsibility Default model tier
L0 — Router Model router, intent classifier, playbook selector Pick model + thinking level per task; never runs tools directly T0 — gemini-3.1-flash-lite (minimal thinking)
L1 — Orchestrator Orchestrator Lifecycle state machine, event routing, escalation, re-plan triggers T2 — gemini-3.1-pro-preview
L2 — Domain Onboarding, Media Plan, Execution, Optimization, Reporting End-to-end domain workflows with tool calling T1–T2 (see roster below)
L3 — Sub-agents Per-platform builders, feed validators, bid calculators, report summarizers Single-purpose tool loops; narrow context; each paired with a QC sub-agent T0–T1
L4 — QC / checker sub-agents One checker per high-stakes main task (see pairing table) Independent Q/A verification; can block, request correction, or escalate T0–T2 (Sonnet only on escalation)

Agent roster

Agent Triggers Outputs Human gate Suggested model
Orchestrator Events, schedules, human approvals Task routing, state machine transitions Escalations only gemini-3.1-pro-preview
Router / classifier Every inbound agent task Model tier, thinking level, playbook ID None gemini-3.1-flash-lite (minimal)
Onboarding New client intake Account map, tracking checklist, verification status BM access grant, agency billing profile gemini-3.5-flash (medium)
Media Plan Brief, budget, vertical rules Master plan, track drafts (always-on, branding, engagement), event drafts (special days) Plan approval per track gemini-3.1-pro-preview / gemini-3.5-flash (events)
Plan Revise Drift threshold, reporting signal, client request Per-track vN+1 revise/replan + diff vs manifest slice Same as plan approval (per track_id) gemini-3.1-pro-preview
Execution Approved plan version Campaign structures on platforms Launch confirmation if policy requires gemini-3.5-flash (medium; high for multi-platform)
Optimization Performance deltas, rules Bid/budget/audience/creative changes Changes above threshold gemini-3.5-flash (medium)
Reporting Schedule, ad-hoc request Reports, anomalies, opt changelog, plan drift, revise recommendations None (read-only) gemini-3.1-flash-lite (low) + Batch for digests
QC — Plan validator Plan draft ready Pass/fail + structured diffs vs brief Blocks approval on fail gemini-3.5-flash (high)
QC — Compliance Creative / copy before launch Policy flags (health, education, claims) Blocks launch on fail gemini-3.5-flash (high)
QC — Tracking health Pre-launch, daily sweep Green/amber/red checklist Blocks spend increases on red gemini-3.5-flash (low)
QC — Spend guardrails Optimization proposals Approve / escalate vs approved plan caps Escalates above threshold gemini-3.5-flash (medium)

State machine (client lifecycle)

contract signedaccounts provisionedGA4 + tags validatedplan generatedhuman approvescampaigns activeperformance dataavailablechanges appliedmaterial strategy shiftdrift threshold orreporting signalrevise or replan draft vN+1new plan versionbudget or client holdresume approvedIntakeOnboardingTrackingSetupPlanningAwaitingApprovalExecutingLiveOptimizingReplanningPlanRevisePaused

Orchestration patterns

1. Plan–execute separation

  • Agents never execute spend against an unapproved plan version.
  • Plan document is immutable once approved; changes create plan_vN+1.

2. Guardrails (hard limits)

Examples (exact values configured per tenant):

  • Max daily budget delta without approval: e.g. 20%
  • Blocked actions: delete conversion actions, change billing account
  • Required checks before launch: tracking health = green, feed errors = 0 critical

3. Tool access

Each agent has an allowlist of tools (service APIs):

Onboarding Agent → onboarding.createAccount, verification.request, bm.linkAsset
Plan Agent → planning.draftMaster, planning.draftTrack, planning.draftEvent, planning.validate
Plan Revise Agent → planning.revise, planning.diffManifest, planning.recommendRevise (per `track_id`)
Execution Agent → execution.applyPlan, execution.pauseCampaign, execution.syncManifest
Optimization Agent → optimization.propose, optimization.apply, optimization.computeDrift
Reporting Agent → reporting.changelog, reporting.planDrift, reporting.recommendRevise

4. Observability

Every agent run produces:

  • run_id, agent, tenant_id, track_id, plan_version
  • Input context hash (no PII in logs)
  • Tool calls and outcomes
  • Human escalation reason if blocked

QC correction loops are logged on every pass/fail and correction attempt — not only when A8 fires. See QC loop telemetry.

5. Per-task QC gates and correction loops

Every mutable or client-visible main task runs through a checker sub-agent (Q/A gate) before the orchestrator marks it complete or applies platform mutations. Read-only tasks (e.g. daily KPI digest) skip QC unless anomaly is detected.

passfailYespassfailNoMain sub-agent draftQC checker sub-agentCommit / next stepCorrection count < 2?Main sub-agent correctsQC re-checkRed flag HITL + opsalert

Main task → QC checker pairing

Main task (L2 / L3) QC checker (L4) Blocks on fail?
plan.draft / plan.track.draft / plan.event.draft qc.plan Yes — no human approval until pass
plan.revise (per track) qc.plan Yes — per track_id
exec.campaign.build (creative included) qc.compliance Yes — no launch
exec.campaign.build (structure only) qc.tracking + qc.plan (slice vs approved plan) Yes
exec.campaign.mutate qc.spend + qc.tracking Yes if over cap or tracking red
opt.cycle (per track) qc.spend + drift check Yes if over guardrail; may emit plan.revise.recommended
onboard.step (tracking-related) qc.tracking Yes before go-live
feed.validate qc.plan (catalog vs brief) Yes if critical errors
report.anomaly qc.plan (vs approved KPIs) No — recommends per-track revise / replan
report.plan_drift No — read-only; per-track weekly rollup

Compliance QC may chain to qc.compliance.escalate (Sonnet) only on low-confidence or high-risk vertical — not on every run.

Loop limits (hard caps — no infinite loops)

Policy: agents never loop indefinitely. Every loop type has a hard ceiling enforced by the orchestrator in code — not by prompt instruction alone. When a ceiling is hit, the run stops immediately, state is frozen, and a red flag is raised for Kobi ops (see below). No silent retries, no "try one more time" without a new run_id and human acknowledgment.

Loop type Max iterations On exhaustion
QC correction loop (main → QC → fix → QC) 2 correction attempts per run_id Red flag → HITL ticket (A8); block downstream mutations
Tool-call loop (single sub-agent, multi-step tools) 5 tool rounds per invocation Abort; rollback marker; red flag → HITL (A8)
Orchestrator retry (transient API / timeout) 1 full re-run per run_id Red flag → HITL (A8)
Cross-agent re-dispatch (same task, new agent) 0 without human — must open new run_id after A8 resolution Prevents disguised infinite loops
Model tier promotion (QC fail rate spike) Policy-driven — not per-request loop Promote tier for task class 24h
Global ceiling per run_id ≤ 8 total LLM steps (main + QC + corrections + tools combined) Red flag even if individual sub-limits not hit

Configurable per tenant in loop_policy (implementation phase); defaults above are maximums — tenants may be stricter, never looser without admin override.

Red flag on loop exhaustion

When any loop limit is exceeded or the global run_id ceiling is hit:

  1. Stop — no further agent or tool calls for that run_id; partial platform mutations rolled back or marked needs_review
  2. Emit agent.loop.exhausted with tenant_id, track_id, agent, loop_type, attempt_count, last_qc_failure, run_id
  3. HITL ticket (A8) — appears in Human Touch inbox with red priority, SLA timer, structured summary, and recommended actions; full step trace in System Ops
  4. Ops alert — notify on-call / ops channel (email, Slack, or PagerDuty — TBD); include tenant, task, and link to ticket
  5. Client impact guard — block spend increases, launches, and optimization applies tied to that run_id until A8 is resolved or explicitly overridden (A6) by admin
  6. No auto-restart — same logical task requires human to approve a new run_id or take manual action; prevents loop-until-lucky behavior
Fix + new runManual overrideAbandonLoop limit hitFreeze run_idagent.loop.exhaustedA8 red flag ticketOps notificationHuman investigatesNew run_id approvedA6 admin overrideMark task cancelled

QC checker receives: main agent output JSON, approved plan_version, relevant playbook slice, and only the fields needed to verify — not the full conversation history.

QC loop telemetry

Deterministic orchestrator logging (not LLM-generated summaries) records every QC gate and correction loop so ops can answer: which agents fail most, on which models, in which tasks, with which inputs/docs?

failbelow 80%Main agent outputQC checkeragent.qc.resultCorrection attempt Nagent.qc.loopMain correctsBigQuery agent_qc_*System Ops QC panelModel promotion policyQC threshold alert jobagent.qc.threshold.breached

Emit on every QC invocation (agent.qc.result):

Field Example Notes
run_id, step_index run_abc, 3 Tie to Cost Guard token rows
tenant_id, track_id, platform t_12, branding_q2, meta Where in the portfolio
main_task_id, main_agent opt.cycle, optimization.meta Failing worker
qc_task_id, qc_agent qc.spend, qc.spend.checker Which checker rejected
attempt_number 0 = first QC; 1 = after 1st correction Loop depth
outcome pass | fail | escalate escalateqc.compliance.escalate
failure_codes[] budget_over_cap, tracking_pixel_missing From qc_result schema — structured, not prose
model_main, model_qc gemini-3.5-flash, gemini-3.5-flash Models on this step
thinking_level_main, thinking_level_qc medium, high Per thinking matrix
plan_version, manifest_slice_id v14, mslice_9f2 Approved plan context
playbook_versions {routing: 3, vertical: 2, platform.meta: 5, qc: 4} Which rule packs were active
context_refs[] [{type: brief, id: b_7, v: 2}, {type: opt_log, id: cs_44}] Inputs/docs by reference — not full text in BQ
input_slice_hash, main_output_hash SHA-256 Dedup / join without PII
input_slice_uri gs://…/runs/run_abc/step_3_input.json Full structured input in GCS; BQ holds URI only
qc_feedback_uri gs://…/runs/run_abc/step_3_qc.json Checker JSON (failed_checks, suggested_fixes)
latency_ms, input_tokens, output_tokens from usage_metadata Per-step cost attribution

Emit on each correction iteration (agent.qc.loop):

Field Purpose
correction_number 1 or 2 (max per loop policy)
main_task_id, qc_task_id Same pairing as above
delta_applied Structured diff summary from main agent fix (fields changed)
prior_failure_codes[] What QC complained about before fix
post_fix_outcome pass | fail on re-check

On A8 exhaustion, agent.loop.exhausted includes last_qc_failure plus qc_loop_trace_id → full ordered list of agent.qc.result / agent.qc.loop rows for that run_id.

BigQuery tables (partitioned by event_date, clustered by tenant_id, main_task_id):

Table Grain Use
agent_qc_results One row per QC invocation Fail-rate by agent, model, task, platform
agent_qc_loops One row per correction attempt Which failure codes repeat after fix
agent_qc_failures_rollup Daily materialized view Top offenders, playbook version regressions

Example ops queries (illustrative):

  • QC fail rate by main_agent + model_main (rolling 7d)
  • Top failure_codes for qc.compliance on vertical.health
  • Runs where playbook_versions.qc changed and fail rate spiked same day
  • opt.cycle loops where context_refs include stale manifest_slice_id

Dashboard (System Ops)QC health and statistics: leaderboard, model breakdown, drill-down to input_slice_uri / qc_feedback_uri. Human Touch shows breach alerts only, not the statistics console.

Feeds automatic model promotion — rolling 24h fail rate per main_task_id from agent_qc_results drives model promotion / demotion; no manual spreadsheet.

QC success threshold alerts (80% floor)

A deterministic alert job (not an agent) evaluates rollups from agent_qc_results / agent_qc_failures_rollup and fires when a task or subtask cannot maintain the 80% success floor. Below that floor means the pairing is under-performing — prompts, models, playbooks, or context must be optimized, not ignored.

Scope grains (evaluated independently):

Grain Key Example
Task main_task_id opt.cycle across all tenants
Subtask main_task_id + qc_task_id + main_agent exec.campaign.build + qc.compliance on execution.meta
Tenant slice (optional) above + tenant_id One client’s plan.revise failing

Metrics (both computed per grain, rolling window default 24h):

Metric Formula Why
First-pass rate count(attempt_number=0 AND outcome=pass) / count(distinct run_id) Primary signal — cheap runs stay cheap
Eventual success rate count(run ended pass within loop budget) / count(distinct run_id) Catches fixable vs broken pairs

Alert rule (default):

if sample_count >= min_sample
   AND (first_pass_rate < 0.80 OR eventual_success_rate < 0.80):
     emit agent.qc.threshold.breached
     open optimization ticket (engineering — not client HITL)
     run auto-optimization actions (below)
Parameter Default Notes
success_floor 0.80 (80%) Tenant may be stricter; not looser without admin
window 24h rolling Also compute 7d trend for dashboard
min_sample 20 runs / grain / window 10 if tenant-only slice; suppress alert if below (noise)
YesNoagent_qc_failures_rollupQC threshold alert jobrate = 80%?No actionagent.qc.threshold.breachedOps Slack / email /dashboardAuto-optimize task classEngineering ticket +sample runsModel / thinkingpromotionFlag playbook_versions forreview

On breach (agent.qc.threshold.breached) — payload includes grain, main_task_id, qc_task_id, main_agent, first_pass_rate, eventual_success_rate, sample_count, top_failure_codes[], model_breakdown, playbook_versions_mode, sample_run_ids[].

Action Automatic? Purpose
Ops alert Yes Slack / email / PagerDuty — link to System Ops QC panel
Dashboard badge Yes Red row on System Ops leaderboard until rate recovers ≥80% for 24h
Model tier promotion Yes — immediate Skip normal 24h wait; promote one tier (see model promotion)
Thinking level bump Yes e.g. mediumhigh on model_main for that task class
Engineering ticket Yes Bundle: top failure codes, playbook versions, 3× input_slice_uri / qc_feedback_uri
Playbook review flag Yes Pin playbook.qc / playbook.platform.* version for human diff
Disable QC gate Never Quality floor is non-negotiable
Client HITL (A1–A9) No Internal ops / engineering only unless client-visible task is blocked fleet-wide

Recovery: alert clears when the same grain stays ≥80% for a full 24h cooldown window (hysteresis — avoids flap).

Optimize what? Use breach payload to pick the lever:

Dominant signal Likely fix
High failure_codes on one check Update playbook.qc checklist or move check to code
One model_main much worse than others Promote tier or change default routing
Spike after playbook_versions bump Roll back or patch playbook; regression test
High correction count, low first-pass Tighten main-agent prompt / context budget; raise thinking
One tenant only Tenant brief / manifest data issue — ops contacts AM
# Illustrative — implementation phase
qc_telemetry_policy:
  log_every_qc_invocation: true
  store_full_input_in_gcs: true      # BQ = refs + hashes only
  retention_days_bq: 400
  retention_days_gcs: 90             # extend on A8 / compliance hold
  rollup_views: [daily, weekly]

qc_threshold_policy:
  success_floor: 0.80                # alert below 80%
  window_hours: 24
  min_sample_global: 20
  min_sample_tenant: 10
  recovery_cooldown_hours: 24
  auto_promote_on_breach: true
  auto_bump_thinking_on_breach: true

Cost Guard — deterministic spend circuit breaker (not AI)

A separate deterministic service — not an agent, not LLM-mediated — sits in front of every Vertex call and enforces hard spend limits per run_id. It uses the per-task cost catalog and a versioned pricing table (model → $/1M input/output) loaded from config. No model chooses whether to stop; the math does.

Why: loop caps limit iterations but not token blowups within a step (thinking tokens, tool-loop context re-send). Cost Guard catches runaway spend even when loop counts are still legal.

under 3x estimateusage_metadataaccumulate actual= 3x estimateOrchestratorCost Guard serviceVertex AIrun cost ledgerTerminate run_idcost.guard.trippedA9 HITL + ops alert
Component Role AI?
Cost Guard service Pre-check + post-check on every LLM invocation No — pure code
Pricing table model_id → input/output $/1M (mirrors Vertex list prices) No
Estimate resolver Maps task_id + planned steps → estimated_cost_usd for run_id No — reads catalog / composite units
Run cost ledger run_id{estimated, actual, model_breakdown[]} in Firestore or Redis No
Kill switch Blocks new Vertex calls; signals orchestrator to abort agents No

Single cost formula (same for catalog planning, run budget, and live metering):

step_cost = (input_tokens / 1e6 × price_in[model])
          + (output_tokens / 1e6 × price_out[model])   # output includes thinking tokens

Only the token source differs between estimate and actual:

Phase When Token source Purpose
Estimate Run start, before any Vertex call Catalog In tok / Out tok per planned task_id (see cost catalog) Budget ceiling for run_id
Actual After every Vertex API response usage_metadata on the response — never agent self-report Running spend tally

Estimate at run start (when router dispatches a gated task):

# Per planned step — same formula as catalog "Cost / run" column
step_estimate = (catalog[task].in_tok / 1e6 × price_in[model])
              + (catalog[task].out_tok / 1e6 × price_out[model])

estimated_run_cost = Σ step_estimate over planned steps   # e.g. opt.cycle + qc.spend
                   × loop_buffer                          # default 1.08 for gated tasks

catalog[task].expected_cost in the table is exactly this math pre-computed; Cost Guard may read either the USD column or recompute from In/Out + pricing_table_version — they must match.

Stored on run_id before the first LLM call. Estimate is immutable for that run unless admin resets (A6).

Actual after each LLM response — Vertex returns token counts on every generate call:

# Map usage_metadata fields (names vary slightly by SDK; normalize in Cost Guard)
input_tokens  = prompt_token_count
              + cached_content_token_count   # billed at cached input rate if applicable
output_tokens = candidates_token_count
              + thoughts_token_count         # 3.5 Flash thinking — billed at output rate

step_cost = (input_tokens / 1e6 × price_in[model])
          + (output_tokens / 1e6 × price_out[model])
actual_run_cost += step_cost

Cost Guard logs per step: run_id, step_index, model, input_tokens, output_tokens, step_cost_usd, actual_run_cost_usd, estimated_run_cost_usd, ratio. Ledger and BigQuery use API-reported tokens only — never prompt guesses or agent claims.

Optional pre-call check (still deterministic): before forwarding a request, Cost Guard may count input tokens in the outbound payload (Vertex tokenizer or count_tokens API) to warn if a single call's prompt alone exceeds 2× catalog.in_tok for that step — does not replace the 3× run-level trip; catches context blowups early.

Trip rule (default):

if actual_run_cost >= trip_multiplier × estimated_run_cost:
    TERMINATE  # default trip_multiplier = 3.0
Scope Default trip_multiplier On trip
Per run_id 3.0× estimated Kill run; A9 ticket; ops alert
Per tenant / calendar day (optional cap) Admin-set USD ceiling Pause all agent LLM calls for tenant until next day or override
Per environment (dev/staging) Stricter (e.g. 2.0×) Kill + alert engineering

On terminate (cost.guard.tripped):

  1. Hard stop — orchestrator cancels in-flight agent work; Cost Guard rejects subsequent Vertex requests for that run_id with 403 cost_guard_tripped
  2. Rollback policy — same as A8: rollback or mark needs_review for any partial platform mutations
  3. HITL ticket (A9) — distinct from A8 (loop exhaustion): shows estimated vs actual USD, per-step token breakdown, model used, trip multiplier
  4. Ops alert — red notification to on-call with tenant, run_id, task_id, actual / estimated ratio
  5. No auto-restart — new run_id requires human acknowledgment; optional admin may raise estimate ceiling (A6) with audit log
  6. Emit cost.guard.tripped on event bus

Cost Guard cannot be bypassed by agents, prompts, or orchestrator retries. Only admin override (A6) with logged reason may raise the multiplier or authorize a new run with a higher estimate.

# Illustrative — implementation phase
cost_guard_policy:
  trip_multiplier: 3.0          # stop at 3× estimate
  loop_buffer_in_estimate: 1.08 # baked into estimated_run_cost
  pricing_table_version: "2026-06"  # must match catalog revision
  tenant_daily_cap_usd: null    # optional; e.g. 50.00
  block_on_trip: true           # always true in prod

What monthly costs include

Per-task prices in the cost catalog are single successful runs. Monthly totals explicitly count main + QC as separate line items (e.g. opt.cycle + qc.spend).

Correction loops are budgeted separately:

Assumption Value Cost impact
QC fail rate (steady state) ~5–8% with 3.5 Flash QC (was ~8–12% on Flash-Lite)
Avg correction loops when failed 1.1 (most pass on 1st retry) +~6–8% on gated-task LLM spend
Compliance escalation ~2% of compliance QC runs Already line-itemed in Profile B

Composite unit examples (main + QC + expected loop overhead):

Composite Formula Typical cost / unit
Optimization cycle (gated, per track) opt.cycle + qc.spend + 8% loop buffer ~$0.15
Campaign build (gated) exec.campaign.build + qc.compliance + qc.tracking + 8% buffer ~$0.26
Master / full replan plan.draft + qc.plan + 8% buffer ~$0.28
New track (branding, engagement) plan.track.draft + qc.plan + 8% buffer ~$0.20
Special-day event plan plan.event.draft + qc.plan + 8% buffer ~$0.15
Per-track revise plan.revise + qc.plan + 8% buffer ~$0.22

Monthly profile totals use separate main/QC counts; add ~6–8% for correction-loop overhead on gated tasks (lower with 3.5 QC), or use composite units above when estimating.

Event bus (conceptual)

Async events (illustrative names):

Event Producer Consumers
client.onboarding.completed Onboarding Service Orchestrator, Planning
plan.approved Approval Engine Orchestrator, Execution
optimization.applied Optimization Service Reporting, Plan Revise (drift check)
plan.revise.recommended Optimization / Reporting Plan Revise Agent, HITL dashboard
plan.revise.approved Approval Engine Execution, Reporting (rebaseline)
campaign.live Execution Service Optimization, Reporting
ga4.tracking.degraded Tracking Service Orchestrator (block spend)
crm.conversion.batch CRM Connector Tracking Service (CAPI)
agent.qc.result Orchestrator (on every QC pass/fail) BigQuery agent_qc_results, System Ops QC panel
agent.qc.loop Orchestrator (on each correction attempt) BigQuery agent_qc_loops, model promotion policy
agent.qc.threshold.breached QC threshold alert job (scheduled) Ops alerting, auto-optimize, engineering ticket
agent.loop.exhausted Orchestrator Human Touch ticket, System Ops traces, ops alerting, audit log
cost.guard.tripped Cost Guard service Orchestrator kill, HITL (A9), ops alerting, audit log

Implementation may use Pub/Sub, Kafka, or equivalent — TBD with stack.

Model strategy and routing

Principles

  1. Capable, never cheap-at-all-costs — sub-agents use the cheapest tier that still passes QC for that task class. If QC failure rate rises, auto-promote tier (see playbook §4).
  2. Gemini-first on Vertex — native tool calling, context caching, Agent Engine integration, single billing/IAM surface.
  3. gemini-3.5-flash as the primary agentic surface — GA model built for agentic execution, tool loops, and sub-agent deployment. Prefer one capable model + thinking_level per task over mixing many smaller models for agent work. See 3.5 Flash thinking matrix.
  4. Compliance QC on Gemini 3.5 Flash — default compliance gate uses gemini-3.5-flash at high thinking: GA, strong reasoning, structured policy checks, ~50% cheaper than Sonnet on Vertex. Cross-model is optional escalation, not the default (see below).
  5. Cross-model escalation (optional)claude-sonnet-4-6 only when compliance QC is ambiguous (low confidence), vertical risk flag is health / education, or a human requests a second opinion. Reduces cost while keeping an independent check for edge cases.
  6. Opus / Pro escalation onlyclaude-opus-4-8 or gemini-3.1-pro-preview with extended thinking reserved for human-triggered or router-scored "hard" tasks.
  7. No autonomous loops on Lite-only modelsgemini-2.5-flash-lite is allowed only for deterministic structured I/O (JSON transform, field mapping) with schema validation; not for multi-step tool chains.

Model tiers (summary)

Tier Models Use when
T0 — Fast / volume gemini-3.1-flash-lite Routing, classification, batch report digests only — not gated QC or tool loops
T1 — Agentic workhorse gemini-3.5-flash (default), gemini-2.5-flash (fallback) All tool loops, sub-agents, gated QC checkers, optimization, execution
T2 — Planning / orchestration gemini-3.1-pro-preview, gemini-3.5-flash (high), gemini-2.5-pro (fallback) Media plans, orchestrator escalations — Pro when 3.5 high still fails QC repeatedly
T3 — Cross-model escalation claude-sonnet-4-6, claude-haiku-4-5 Ambiguous compliance, high-risk verticals, human-requested second opinion
T4 — Rare escalation claude-opus-4-8 Human-requested deep reasoning only
T-batch — Async digest gemini-3.1-flash-lite Batch, llama-3.3-70b Batch Scheduled report summarization, log digests

Gemini 3.5 Flash is more capable than 3.1 Flash-Lite and 3 Flash Preview for agentic work: GA, stronger tool use, thought preservation across turns, and near-Pro quality at Flash-tier cost. Google positions it explicitly for sub-agent deployment and rapid agentic loops.

Does using more 3.5 with different thinking_level per task make sense? Yes — for any task that calls tools, blocks mutations, or runs a QC correction loop. Use one endpoint, many depths instead of jumping to Pro or mixing preview models. Reserve Flash-Lite for high-volume non-agentic routing and batch digests only.

Task Model thinking_level Why
Router / classify 3.1 Flash-Lite minimal Pure routing — no tools; millions of calls
Onboarding sub-steps 3.5 Flash medium Multi-step setup, platform APIs
Execution / optimization sub-agents 3.5 Flash medium Tool loops; default balances quality + latency
Complex campaign build (multi-platform) 3.5 Flash high More constraints; fewer correction loops
QC — plan validator 3.5 Flash high Must catch budget/channel errors before human approval
QC — compliance 3.5 Flash high Policy + claims reasoning
QC — tracking health 3.5 Flash low Mostly checklist; fast pass/fail
QC — spend guardrails 3.5 Flash medium Numeric compare vs plan — code validates math; LLM interprets intent
Report anomaly triage 3.5 Flash medium Needs reasoning; not mutation-blocking
Daily / weekly digest 3.1 Flash-Lite Batch low Read-only narrative; cost-sensitive
Media plan draft 3.1 Pro Preview (Pro thinking) Long-horizon strategy — 3.5 high is fallback if Pro unavailable
Orchestrator escalation 3.1 Pro Preview (Pro thinking) Conflict resolution across agents

Quality vs cost tradeoff: Putting gated QC on 3.5 Flash (vs a cheaper Flash-Lite QC mix) raises LLM spend by roughly $4–6 / client / month on Standard, but buys fewer correction loops (8% buffer vs ~12%) and a higher first-pass QC rate — usually net-positive once human-review time is counted. Do not run the router or batch digests on 3.5; that would 3–6× those rows for no quality gain. Expected Standard total is **$25.60 / client / month** (see cost catalog and cost band).

Thinking levels (gemini-3.1-flash-lite only)

For T0 tasks on Flash-Lite per Google's guidance:

Level Agent tasks
minimal Router, intent classification
low Batch report KPI bullets, simple field extraction

Model catalog (Vertex AI, June 2026)

Source: Vertex AI Generative AI pricingGlobal endpoint, standard (non-batch) rates per 1M tokens, prompts ≤200K. Cached input shown where available. Verify before implementation; preview models may change.

Primary — Gemini (Google)

Model Input / 1M Output / 1M Cached input / 1M Designed for Kobi use
gemini-3.1-flash-lite (GA) $0.25 $1.50 $0.025 High-volume agentic workflows, tool calling, routing (docs) Default sub-agent + router + checklist QC
gemini-3.5-flash (GA) $1.50 $9.00 $0.15 Frontier-level Flash; agentic execution, coding, long-horizon tool use (docs) Compliance QC, Execution, Optimization (preferred GA workhorse)
gemini-3-flash-preview $0.50 $3.00 $0.05 Agentic workhorse, multimodal, computer use (docs) Fallback if 3.5 unavailable; Onboarding
gemini-3.1-pro-preview $2.00 $12.00 $0.20 Deep agentic reasoning, precise tool use, 1M context (docs) Orchestrator, Media Plan
gemini-2.5-flash (GA) $0.30 $2.50 $0.03 Stable agentic GA; improved tool use (Google blog) Production fallback for T1
gemini-2.5-pro (GA) $1.25 $10.00 $0.13 Complex reasoning, coding Production fallback for T2
gemini-2.5-flash-lite (GA) $0.10 $0.40 $0.01 Ultra-cheap classification/summarization Structured transform only — not autonomous tool loops

Secondary — Anthropic on Vertex (QC & cross-check)

Model Input / 1M Output / 1M Designed for Kobi use
claude-haiku-4.5 $1.00 $5.00 Fast, cost-effective classification (Anthropic pricing) Lightweight policy JSON scans
claude-sonnet-4.6 $3.00 $15.00 Agents, coding, enterprise workflows at scale Compliance escalation — ambiguous QC, health/education high-risk, human-requested
claude-opus-4.8 $5.00 $25.00 Hardest agentic + coding tasks Human-triggered escalation only

Optional — Meta Llama on Vertex (batch / cost experiments)

Model Input / 1M Output / 1M Designed for Kobi use
llama-3.3-70b $0.72 $0.72 Efficient text tasks Batch report narrative (optional)
llama-4-maverick $0.35 $1.15 Multimodal reasoning, tool calling Creative brief / asset review (optional)

Models we do not default to

Model Reason
Grok, Nemotron, Mistral OCR, TTS, embedding-only Wrong task fit or no advantage over Gemini/Claude for media-ops agents
gemini-2.5-flash-lite for tool loops Weak agentic benchmarks (SWE-bench ~32%); acceptable only with strict JSON schema + no tools
Opus / 3.1 Pro for sub-agents 10–50× token cost vs Flash-Lite with no QC benefit on simple tasks

Per-task LLM cost catalog

Vertex global standard pricing. Cost formula:

cost = (input_tokens / 1M × input_rate) + (output_tokens / 1M × output_rate)

Token assumptions are planning defaults for a single successful run of that step. 3.5 Flash output column includes thinking tokens billed at output rate. QC checker runs are separate line items — gated main tasks always incur main + QC (see pairing table). Correction loops add ~6–8% on gated tasks with 3.5 QC (see loop limits). Context caching on system prompts typically reduces input cost 30–50% in steady state — figures below are without cache (conservative).

Task ID Task Agent / layer Model Thinking In tok Out tok† Cost / run
route.dispatch Route Pub/Sub event to agent + playbook Router 3.1 Flash-Lite minimal 3K 0.5K $0.0015
route.classify Classify complexity + model tier Router 3.1 Flash-Lite low 5K 0.8K $0.0025
onboard.step One onboarding sub-step (account, tag, verify) Onboarding sub 3.5 Flash medium 12K 4.5K $0.059
plan.draft Master plan or full replan (envelope + tracks) Media Plan 3.1 Pro Preview 30K 10K $0.18
plan.track.draft New track (branding, engagement, always-on split) Media Plan 3.1 Pro Preview 20K 6K $0.11
plan.event.draft Special-day / event flight (time-boxed) Media Plan 3.5 Flash medium 14K 4.5K $0.062
plan.revise Per-track revise vN+1 from opt log + manifest Plan Revise 3.1 Pro Preview 22K 7K $0.13
report.plan_drift Per-track drift + changelog (may batch tracks) Reporting 3.5 Flash low 25K 4K $0.074
qc.plan Plan validator vs brief QC 3.5 Flash high 18K 5.5K $0.077
qc.compliance Creative / copy policy gate QC 3.5 Flash high 12K 4K $0.054
qc.compliance.escalate Cross-model second opinion (rare) QC Sonnet 4.6 12K 2.5K $0.074
qc.tracking Tracking health sweep QC 3.5 Flash low 10K 2K $0.033
qc.spend Spend guardrail vs approved plan QC 3.5 Flash medium 10K 2.5K $0.038
exec.campaign.build Build one platform campaign slice Execution sub 3.5 Flash medium 40K 10K $0.15
exec.campaign.mutate Pause, budget nudge, status change Execution sub 3.5 Flash medium 15K 4K $0.059
opt.cycle Analyze performance + propose change Optimization 3.5 Flash medium 22K 7.5K $0.101
feed.validate Feed / catalog batch validation Onboarding / Execution sub 3.5 Flash low 8K 2K $0.030
report.daily Daily KPI digest Reporting 3.1 Flash-Lite Batch low 35K 4K $0.007
report.weekly Weekly client narrative Reporting 3.1 Flash-Lite low 45K 6K $0.020
report.anomaly Anomaly triage + recommendation Reporting 3.5 Flash medium 28K 9K $0.123
orch.escalate Orchestrator re-plan / conflict resolution Orchestrator 3.1 Pro Preview 20K 5K $0.10

† For gemini-3.5-flash, Out tok = visible output + thinking tokens (billed at output rate per Vertex pricing).

Batch API for scheduled / non-urgent tasks

Vertex Batch API = 50% discount on eligible Gemini models, in exchange for async completion (up to ~24h). Use it only for scheduled, non-blocking work — never for real-time routing, gated QC, platform mutations, interactive planning, or anomaly alerts (<1h SLA).

Task Standard / run Batch / run (−50%) Eligible?
report.daily $0.007 $0.007 ✓ already batched
report.weekly $0.020 $0.010 ✓ scheduled
report.plan_drift $0.074 $0.037 ✓ scheduled (per-track rollup)
feed.validate (scheduled catalog sweep) $0.030 $0.015 ✓ when not pre-launch gating
report.monthly (exec summary) ~$0.15 ~$0.075 ✓ scheduled
route.*, plan.*, qc.*, exec.*, opt.cycle, report.anomaly, orch.escalate ✗ real-time / gated / interactive

Reality check: batch trims the reporting + feed-sweep slice only (~2–4% of total). The real cost drivers — optimization cycles, execution mutations, QC gates — are real-time and stay at standard rates. Batch is worthwhile (free 50% on eligible tasks) but is not where the savings concentrate; context caching matters far more.

Monthly task volume & LLM cost by client profile

Illustrative steady-state volumes (after onboarding month). Costs assume multiple concurrent plan tracks per client (always-on, branding, engagement, special-day events) — not a single monolithic plan. See plan tracks.

QC pairing: each plan.* row includes + qc.plan ($0.077) unless noted.

Profile A — Starter (Google + Meta, low activity)

~$15–25K/mo media spend, 2 active tracks (always-on + seasonal/event), optimization 2–3×/week.

Task ID Runs / month Unit cost (incl. QC where gated) Monthly
route.dispatch 320 $0.0015 $0.48
route.classify 110 $0.0025 $0.28
Planning (multi-track)
plan.draft master refresh 0.25 $0.257 $0.06
plan.track.draft + qc.plan 0.5 $0.187 $0.09
plan.event.draft + qc.plan 1 $0.139 $0.14
plan.revise + qc.plan 2 $0.207 $0.41
report.plan_drift (2 tracks) 4 $0.074 $0.30
qc.compliance 5 $0.054 $0.27
exec.campaign.build 6 $0.15 $0.90
exec.campaign.mutate 4 $0.059 $0.24
opt.cycle + qc.spend 24 + 24 $0.139 $3.34
qc.tracking 30 $0.033 $0.99
report.daily + report.weekly 30 + 4 $0.29
report.anomaly 2 $0.123 $0.25
Correction-loop overhead (~8% of gated) ~$0.52
LLM subtotal (standard) ~$8.55
  • With context caching (40% input savings): **$6.40–6.90 / client / month**
  • With caching + batch on scheduled reports (saves $0.19): **$6.25–6.70 / client / month**

Profile B — Standard (Google + Meta + GA4 + CRM, active optimization)

~$50–150K/mo media spend, 3–4 active tracks (always-on, branding, engagement, seasonal), daily optimization.

Task ID Runs / month Unit cost (incl. QC where gated) Monthly
route.dispatch 700 $0.0015 $1.05
route.classify 240 $0.0025 $0.60
Planning (multi-track)
plan.draft master refresh 0.25 $0.257 $0.06
plan.track.draft + qc.plan 1 $0.187 $0.19
plan.event.draft + qc.plan 2 $0.139 $0.28
plan.revise + qc.plan 4 $0.207 $0.83
report.plan_drift (3–4 tracks) 8 $0.074 $0.59
qc.compliance 14 $0.054 $0.76
exec.campaign.build 22 $0.15 $3.30
exec.campaign.mutate 14 $0.059 $0.83
opt.cycle + qc.spend 95 + 95 $0.139 $13.21
qc.tracking 30 $0.033 $0.99
report.daily + report.weekly 30 + 4 $0.29
report.anomaly 6 $0.123 $0.74
orch.escalate 2 $0.10 $0.20
qc.compliance.escalate 1 $0.074 $0.07
Correction-loop overhead (~8% of gated) ~$1.64
LLM subtotal (standard) ~$25.60
  • With context caching (40% input savings): **$19–22 / client / month**
  • With caching + batch on scheduled reports (saves $0.34): **$18.70–21.70 / client / month**

Profile C — Ecommerce (feeds, catalog, high optimization)

~$150K+/mo media spend, 4–6 active tracks (+ flash sales, catalog pushes), hourly pacing.

Task ID Runs / month Unit cost (incl. QC where gated) Monthly
Profile B planning + ops base $25.60
Extra planning (events / tracks)
plan.event.draft + qc.plan +2 $0.139 +$0.28
plan.revise + qc.plan +2 $0.207 +$0.41
plan.track.draft + qc.plan +0.5 $0.187 +$0.09
report.plan_drift +4 $0.074 +$0.30
feed.validate 30 $0.030 $0.90
Extra opt.cycle + qc.spend +45 +45 $0.139 +$6.26
Extra exec.campaign.build +12 $0.15 +$1.80
Extra qc.compliance +6 $0.054 +$0.32
Extra route.dispatch +250 $0.0015 +$0.38
Correction-loop overhead (extra gated) ~$0.80
LLM subtotal (standard) ~$37.15
  • With context caching (40% input savings): **$27–31 / client / month**
  • With caching + batch on scheduled reports + feed sweeps (saves $0.93): **$26.50–30 / client / month**

Onboarding month (one-time, any profile)

Task ID Runs (typical) Unit cost One-time
onboard.step 18–25 $0.059 $1.05–1.48
route.dispatch + route.classify 80 + 40 $0.0015 + $0.0025 $0.22
qc.tracking (setup) 10 $0.033 $0.33
Onboarding LLM ~$1.60–2.00

Portfolio-level estimate

Portfolio size Profile mix Standard (no cache) + caching + caching & batch
10 clients 6 Starter + 4 Standard ~$154 ~$120 ~$118
50 clients 25 + 20 + 5 Ecommerce ~$912 ~$712 ~$707
200 clients 100 + 80 + 20 ~$3,650 ~$2,845 ~$2,830

Assumes multi-track planning volumes above; actuals scale with count of active event/sale plans per client. Batch column reflects scheduled reporting + feed sweeps only — the marginal step beyond caching is small because cost concentrates in real-time optimization, execution, and QC.

Note: These are Vertex LLM inference only — not media spend, Kobi management fees, Cloud Run, BigQuery, or platform API costs. Log actual run_id token usage to BigQuery per tenant_id for invoice-grade allocation if pass-through is contracted. See Billing & invoicing for client-facing monthly invoices (media spend + fees).

Realistic cost band (treat point estimates as the expected case)

The per-task tokens above are single-shot, expected-case assumptions. Two factors can push real cost up materially; size budgets with a band, not a point:

Factor Effect Multiplier on affected tasks
Thinking tokens at high 3.5 Flash high can emit far more reasoning than the 4–5.5K assumed for qc.plan / qc.compliance / multi-constraint builds output tokens ×1.5–2.5
Tool-loop context re-send Each tool round re-sends growing context; catalog prices exec.* / opt.cycle as a single shot input tokens ×2–4 over 3–5 rounds
Transient retries Orchestrator retry (1×) + occasional platform re-fetch +5–10% on tool tasks
Profile Low (caching + batch) Expected (standard) High (thinking + tool-loops)
Starter ~$6.40 ~$8.55 ~$15–18
Standard ~$19 ~$25.60 ~$48–55
Ecommerce ~$27 ~$37.15 ~$72–82

Even the high case stays a tiny fraction of media spend and Kobi fees — but design for the high band, then measure real run_id usage in a pilot to replace these assumptions.

Actual cost is dominated by optimization frequency and multi-turn tool loops — the playbook below targets both.

Token-efficiency playbook

Goal: maximum quality per token — structured outputs, cached context, minimal re-prompting.

1. Prompt architecture

Rule Implementation
Split system vs task Stable system prompt + tool schemas → Vertex context cache (refresh only on playbook version bump)
Structured outputs only Agents return JSON matching versioned schemas (plan_vN, mutation_manifest, qc_result) — no prose in tool paths
Reference by ID Pass tenant_id, plan_version, entity_id — never re-embed full plan in every sub-agent call; sub-agents fetch slice via tool
One job per sub-agent e.g. meta.campaign.create not meta.everything — smaller context, fewer hallucinated side effects

2. Context budget

Context type Max tokens (target) Cache?
System + tool schemas ≤ 8K Yes
Tenant playbook snippet ≤ 2K Yes
Task payload (this step only) ≤ 4K No
Platform API response (trimmed) ≤ 6K No
Total per sub-agent call ≤ 20K

Trim platform API responses to fields the schema requires. Store full responses in GCS/BQ; pass URI to reporting agents only.

3. Tool-call discipline

  • Plan tools explicitly — max 3 tools per sub-agent invocation unless router scores complex.
  • Idempotent tools — safe to retry; orchestrator dedupes by run_id.
  • No LLM in hot path for math — budget splits, bid deltas, % thresholds computed in code; LLM proposes intent, code validates numbers.
  • Batch non-urgent work — route all scheduled, non-blocking tasks (daily/weekly/monthly digests, per-track drift reports, scheduled feed sweeps) through Vertex Batch API for a 50% discount on eligible Gemini models. Never batch real-time routing, gated QC, mutations, or anomaly alerts. See Batch API for scheduled tasks.

4. Model promotion / demotion (automatic)

Signal Action
QC fail rate > 5% for task class (rolling 24h) Promote task class one tier (e.g. Flash-Lite → 3 Flash)
QC success rate < 80% (task or subtask grain) Alert + immediate promote + thinking bump — see QC success threshold alerts
QC pass rate > 99% for 7d on task class Trial demote one tier; revert if fails spike
p95 latency > SLA Lower thinking level or switch to Flash-Lite
Human escalation on reasoning Pin task class to T2+ for 30d

5. Playbook registry (per tenant / vertical)

Versioned artifacts cached in Vertex context cache:

Playbook Contents Changes when
playbook.routing Task class → model tier + thinking level Model catalog update
playbook.vertical.{health|school|…} Compliance keywords, blocked claims, KPI defaults Vertical config change
playbook.platform.{google|meta|…} Tool allowlist, API field maps, rate-limit hints Platform spec update
playbook.qc QC agent checklist templates Policy change

Router selects playbook IDs; agents never inline full vertical rules in every prompt.

6. Output quality without extra tokens

  • Two-pass by design: every gated main task → paired QC checker (see §5). Max 2 correction loops, then human. No third QC pass unless compliance escalates to Sonnet.
  • Confidence gate: if router confidence < 0.85 on classification, escalate to medium thinking or T1 model — cheaper than a failed tool loop + retry.
  • Log token usage per run_id, agent, model, thinking_level → BigQuery for cost attribution per tenant.
  • Log every QC loopagent.qc.result + agent.qc.loop with agent, model, task, playbook versions, and input/doc refs (see QC loop telemetry).

Failure handling

Failure Behavior
Platform API rate limit Exponential backoff; partial apply with rollback marker
Agent timeout Orchestrator retries once; then red flag (A8) if still failing
Loop limit exceeded Red flag (A8) + agent.loop.exhausted; no further auto-retry
Cost Guard trip (≥3× estimate) Terminate run_id; A9 + cost.guard.tripped; all Vertex calls blocked
Guardrail violation Block action; create approval ticket
Tracking down Pause optimization spend increases; alert ops
Repeated A8 on same task class (≥3 in 24h) Escalate to engineering + optional tenant pause
Repeated A9 on same task class (≥3 in 24h) Review pricing table / estimate resolver; tighten trip_multiplier or task catalog
QC success < 80% (task/subtask, min sample) agent.qc.threshold.breached → ops alert + auto-optimize; engineering ticket