Agentic Orchestration Model

Overview

Digital media operations decompose into specialized agents supervised by a central orchestrator. Agents call domain services and platform connectors; they do not hold long-lived credentials directly — services inject scoped tokens per request.

Deployment: agents run on Vertex AI Agent Engine; models are served from Vertex AI Model Garden (see GCP deployment topology — AI & agent runtime). Cloud Run hosts the orchestrator control plane, connectors, and HITL API.

Agent hierarchy

Three layers — each with a defined model tier (see Model catalog):

Layer	Agents	Responsibility	Default model tier
L0 — Router	Model router, intent classifier, playbook selector	Pick model + thinking level per task; never runs tools directly	T0 — `gemini-3.1-flash-lite` (minimal thinking)
L1 — Orchestrator	Orchestrator	Lifecycle state machine, event routing, escalation, re-plan triggers	T2 — `gemini-3.1-pro-preview`
L2 — Domain	Onboarding, Media Plan, Execution, Optimization, Reporting	End-to-end domain workflows with tool calling	T1–T2 (see roster below)
L3 — Sub-agents	Per-platform builders, feed validators, bid calculators, report summarizers	Single-purpose tool loops; narrow context; each paired with a QC sub-agent	T0–T1
L4 — QC / checker sub-agents	One checker per high-stakes main task (see pairing table)	Independent Q/A verification; can block, request correction, or escalate	T0–T2 (Sonnet only on escalation)

Agent roster

Agent	Triggers	Outputs	Human gate	Suggested model
Orchestrator	Events, schedules, human approvals	Task routing, state machine transitions	Escalations only	`gemini-3.1-pro-preview`
Router / classifier	Every inbound agent task	Model tier, thinking level, playbook ID	None	`gemini-3.1-flash-lite` (minimal)
Onboarding	New client intake	Account map, tracking checklist, verification status	BM access grant, agency billing profile	`gemini-3.5-flash` (medium)
Media Plan	Brief, budget, vertical rules	Master plan, track drafts (always-on, branding, engagement), event drafts (special days)	Plan approval per track	`gemini-3.1-pro-preview` / `gemini-3.5-flash` (events)
Plan Revise	Drift threshold, reporting signal, client request	Per-track `vN+1` revise/replan + diff vs manifest slice	Same as plan approval (per `track_id`)	`gemini-3.1-pro-preview`
Execution	Approved plan version	Campaign structures on platforms	Launch confirmation if policy requires	`gemini-3.5-flash` (medium; high for multi-platform)
Optimization	Performance deltas, rules	Bid/budget/audience/creative changes	Changes above threshold	`gemini-3.5-flash` (medium)
Reporting	Schedule, ad-hoc request	Reports, anomalies, opt changelog, plan drift, revise recommendations	None (read-only)	`gemini-3.1-flash-lite` (low) + Batch for digests
QC — Plan validator	Plan draft ready	Pass/fail + structured diffs vs brief	Blocks approval on fail	`gemini-3.5-flash` (high)
QC — Compliance	Creative / copy before launch	Policy flags (health, education, claims)	Blocks launch on fail	`gemini-3.5-flash` (high)
QC — Tracking health	Pre-launch, daily sweep	Green/amber/red checklist	Blocks spend increases on red	`gemini-3.5-flash` (low)
QC — Spend guardrails	Optimization proposals	Approve / escalate vs approved plan caps	Escalates above threshold	`gemini-3.5-flash` (medium)

State machine (client lifecycle)

Orchestration patterns

1. Plan–execute separation

Agents never execute spend against an unapproved plan version.
Plan document is immutable once approved; changes create plan_vN+1.

2. Guardrails (hard limits)

Examples (exact values configured per tenant):

Max daily budget delta without approval: e.g. 20%
Blocked actions: delete conversion actions, change billing account
Required checks before launch: tracking health = green, feed errors = 0 critical

3. Tool access

Each agent has an allowlist of tools (service APIs):

Onboarding Agent → onboarding.createAccount, verification.request, bm.linkAsset
Plan Agent → planning.draftMaster, planning.draftTrack, planning.draftEvent, planning.validate
Plan Revise Agent → planning.revise, planning.diffManifest, planning.recommendRevise (per `track_id`)
Execution Agent → execution.applyPlan, execution.pauseCampaign, execution.syncManifest
Optimization Agent → optimization.propose, optimization.apply, optimization.computeDrift
Reporting Agent → reporting.changelog, reporting.planDrift, reporting.recommendRevise

4. Observability

Every agent run produces:

run_id, agent, tenant_id, track_id, plan_version
Input context hash (no PII in logs)
Tool calls and outcomes
Human escalation reason if blocked

QC correction loops are logged on every pass/fail and correction attempt — not only when A8 fires. See QC loop telemetry.

5. Per-task QC gates and correction loops

Every mutable or client-visible main task runs through a checker sub-agent (Q/A gate) before the orchestrator marks it complete or applies platform mutations. Read-only tasks (e.g. daily KPI digest) skip QC unless anomaly is detected.

Main task → QC checker pairing

Main task (L2 / L3)	QC checker (L4)	Blocks on fail?
`plan.draft` / `plan.track.draft` / `plan.event.draft`	`qc.plan`	Yes — no human approval until pass
`plan.revise` (per track)	`qc.plan`	Yes — per `track_id`
`exec.campaign.build` (creative included)	`qc.compliance`	Yes — no launch
`exec.campaign.build` (structure only)	`qc.tracking` + `qc.plan` (slice vs approved plan)	Yes
`exec.campaign.mutate`	`qc.spend` + `qc.tracking`	Yes if over cap or tracking red
`opt.cycle` (per track)	`qc.spend` + drift check	Yes if over guardrail; may emit `plan.revise.recommended`
`onboard.step` (tracking-related)	`qc.tracking`	Yes before go-live
`feed.validate`	`qc.plan` (catalog vs brief)	Yes if critical errors
`report.anomaly`	`qc.plan` (vs approved KPIs)	No — recommends per-track revise / replan
`report.plan_drift`	—	No — read-only; per-track weekly rollup

Compliance QC may chain to qc.compliance.escalate (Sonnet) only on low-confidence or high-risk vertical — not on every run.

Loop limits (hard caps — no infinite loops)

Policy: agents never loop indefinitely. Every loop type has a hard ceiling enforced by the orchestrator in code — not by prompt instruction alone. When a ceiling is hit, the run stops immediately, state is frozen, and a red flag is raised for Kobi ops (see below). No silent retries, no "try one more time" without a new run_id and human acknowledgment.

Loop type	Max iterations	On exhaustion
QC correction loop (main → QC → fix → QC)	2 correction attempts per `run_id`	Red flag → HITL ticket (A8); block downstream mutations
Tool-call loop (single sub-agent, multi-step tools)	5 tool rounds per invocation	Abort; rollback marker; red flag → HITL (A8)
Orchestrator retry (transient API / timeout)	1 full re-run per `run_id`	Red flag → HITL (A8)
Cross-agent re-dispatch (same task, new agent)	0 without human — must open new `run_id` after A8 resolution	Prevents disguised infinite loops
Model tier promotion (QC fail rate spike)	Policy-driven — not per-request loop	Promote tier for task class 24h
Global ceiling per `run_id`	≤ 8 total LLM steps (main + QC + corrections + tools combined)	Red flag even if individual sub-limits not hit

Configurable per tenant in loop_policy (implementation phase); defaults above are maximums — tenants may be stricter, never looser without admin override.

Red flag on loop exhaustion

When any loop limit is exceeded or the global run_id ceiling is hit:

Stop — no further agent or tool calls for that run_id; partial platform mutations rolled back or marked needs_review
Emit agent.loop.exhausted with tenant_id, track_id, agent, loop_type, attempt_count, last_qc_failure, run_id
HITL ticket (A8) — appears in Human Touch inbox with red priority, SLA timer, structured summary, and recommended actions; full step trace in System Ops
Ops alert — notify on-call / ops channel (email, Slack, or PagerDuty — TBD); include tenant, task, and link to ticket
Client impact guard — block spend increases, launches, and optimization applies tied to that run_id until A8 is resolved or explicitly overridden (A6) by admin
No auto-restart — same logical task requires human to approve a new run_id or take manual action; prevents loop-until-lucky behavior

QC checker receives: main agent output JSON, approved plan_version, relevant playbook slice, and only the fields needed to verify — not the full conversation history.

QC loop telemetry

Deterministic orchestrator logging (not LLM-generated summaries) records every QC gate and correction loop so ops can answer: which agents fail most, on which models, in which tasks, with which inputs/docs?

Emit on every QC invocation (agent.qc.result):

Field	Example	Notes
`run_id`, `step_index`	`run_abc`, `3`	Tie to Cost Guard token rows
`tenant_id`, `track_id`, `platform`	`t_12`, `branding_q2`, `meta`	Where in the portfolio
`main_task_id`, `main_agent`	`opt.cycle`, `optimization.meta`	Failing worker
`qc_task_id`, `qc_agent`	`qc.spend`, `qc.spend.checker`	Which checker rejected
`attempt_number`	`0` = first QC; `1` = after 1st correction	Loop depth
`outcome`	`pass` \| `fail` \| `escalate`	`escalate` → `qc.compliance.escalate`
`failure_codes[]`	`budget_over_cap`, `tracking_pixel_missing`	From `qc_result` schema — structured, not prose
`model_main`, `model_qc`	`gemini-3.5-flash`, `gemini-3.5-flash`	Models on this step
`thinking_level_main`, `thinking_level_qc`	`medium`, `high`	Per thinking matrix
`plan_version`, `manifest_slice_id`	`v14`, `mslice_9f2`	Approved plan context
`playbook_versions`	`{routing: 3, vertical: 2, platform.meta: 5, qc: 4}`	Which rule packs were active
`context_refs[]`	`[{type: brief, id: b_7, v: 2}, {type: opt_log, id: cs_44}]`	Inputs/docs by reference — not full text in BQ
`input_slice_hash`, `main_output_hash`	SHA-256	Dedup / join without PII
`input_slice_uri`	`gs://…/runs/run_abc/step_3_input.json`	Full structured input in GCS; BQ holds URI only
`qc_feedback_uri`	`gs://…/runs/run_abc/step_3_qc.json`	Checker JSON (`failed_checks`, `suggested_fixes`)
`latency_ms`, `input_tokens`, `output_tokens`	from `usage_metadata`	Per-step cost attribution

Emit on each correction iteration (agent.qc.loop):

Field	Purpose
`correction_number`	`1` or `2` (max per loop policy)
`main_task_id`, `qc_task_id`	Same pairing as above
`delta_applied`	Structured diff summary from main agent fix (fields changed)
`prior_failure_codes[]`	What QC complained about before fix
`post_fix_outcome`	`pass` \| `fail` on re-check

On A8 exhaustion, agent.loop.exhausted includes last_qc_failure plus qc_loop_trace_id → full ordered list of agent.qc.result / agent.qc.loop rows for that run_id.

BigQuery tables (partitioned by event_date, clustered by tenant_id, main_task_id):

Table	Grain	Use
`agent_qc_results`	One row per QC invocation	Fail-rate by agent, model, task, platform
`agent_qc_loops`	One row per correction attempt	Which failure codes repeat after fix
`agent_qc_failures_rollup`	Daily materialized view	Top offenders, playbook version regressions

Example ops queries (illustrative):

QC fail rate by main_agent + model_main (rolling 7d)
Top failure_codes for qc.compliance on vertical.health
Runs where playbook_versions.qc changed and fail rate spiked same day
opt.cycle loops where context_refs include stale manifest_slice_id

Dashboard (System Ops) — QC health and statistics: leaderboard, model breakdown, drill-down to input_slice_uri / qc_feedback_uri. Human Touch shows breach alerts only, not the statistics console.

Feeds automatic model promotion — rolling 24h fail rate per main_task_id from agent_qc_results drives model promotion / demotion; no manual spreadsheet.

QC success threshold alerts (80% floor)

A deterministic alert job (not an agent) evaluates rollups from agent_qc_results / agent_qc_failures_rollup and fires when a task or subtask cannot maintain the 80% success floor. Below that floor means the pairing is under-performing — prompts, models, playbooks, or context must be optimized, not ignored.

Scope grains (evaluated independently):

Grain	Key	Example
Task	`main_task_id`	`opt.cycle` across all tenants
Subtask	`main_task_id` + `qc_task_id` + `main_agent`	`exec.campaign.build` + `qc.compliance` on `execution.meta`
Tenant slice (optional)	above + `tenant_id`	One client’s `plan.revise` failing

Metrics (both computed per grain, rolling window default 24h):

Metric	Formula	Why
First-pass rate	`count(attempt_number=0 AND outcome=pass)` / `count(distinct run_id)`	Primary signal — cheap runs stay cheap
Eventual success rate	`count(run ended pass within loop budget)` / `count(distinct run_id)`	Catches fixable vs broken pairs

Alert rule (default):

if sample_count >= min_sample
   AND (first_pass_rate < 0.80 OR eventual_success_rate < 0.80):
     emit agent.qc.threshold.breached
     open optimization ticket (engineering — not client HITL)
     run auto-optimization actions (below)

Parameter	Default	Notes
`success_floor`	0.80 (80%)	Tenant may be stricter; not looser without admin
`window`	24h rolling	Also compute 7d trend for dashboard
`min_sample`	20 runs / grain / window	10 if tenant-only slice; suppress alert if below (noise)

On breach (agent.qc.threshold.breached) — payload includes grain, main_task_id, qc_task_id, main_agent, first_pass_rate, eventual_success_rate, sample_count, top_failure_codes[], model_breakdown, playbook_versions_mode, sample_run_ids[].

Action	Automatic?	Purpose
Ops alert	Yes	Slack / email / PagerDuty — link to System Ops QC panel
Dashboard badge	Yes	Red row on System Ops leaderboard until rate recovers ≥80% for 24h
Model tier promotion	Yes — immediate	Skip normal 24h wait; promote one tier (see model promotion)
Thinking level bump	Yes	e.g. `medium` → `high` on `model_main` for that task class
Engineering ticket	Yes	Bundle: top failure codes, playbook versions, 3× `input_slice_uri` / `qc_feedback_uri`
Playbook review flag	Yes	Pin `playbook.qc` / `playbook.platform.*` version for human diff
Disable QC gate	Never	Quality floor is non-negotiable
Client HITL (A1–A9)	No	Internal ops / engineering only unless client-visible task is blocked fleet-wide

Recovery: alert clears when the same grain stays ≥80% for a full 24h cooldown window (hysteresis — avoids flap).

Optimize what? Use breach payload to pick the lever:

Dominant signal	Likely fix
High `failure_codes` on one check	Update `playbook.qc` checklist or move check to code
One `model_main` much worse than others	Promote tier or change default routing
Spike after `playbook_versions` bump	Roll back or patch playbook; regression test
High correction count, low first-pass	Tighten main-agent prompt / context budget; raise thinking
One tenant only	Tenant brief / manifest data issue — ops contacts AM

# Illustrative — implementation phase
qc_telemetry_policy:
  log_every_qc_invocation: true
  store_full_input_in_gcs: true      # BQ = refs + hashes only
  retention_days_bq: 400
  retention_days_gcs: 90             # extend on A8 / compliance hold
  rollup_views: [daily, weekly]

qc_threshold_policy:
  success_floor: 0.80                # alert below 80%
  window_hours: 24
  min_sample_global: 20
  min_sample_tenant: 10
  recovery_cooldown_hours: 24
  auto_promote_on_breach: true
  auto_bump_thinking_on_breach: true

Cost Guard — deterministic spend circuit breaker (not AI)

A separate deterministic service — not an agent, not LLM-mediated — sits in front of every Vertex call and enforces hard spend limits per run_id. It uses the per-task cost catalog and a versioned pricing table (model → $/1M input/output) loaded from config. No model chooses whether to stop; the math does.

Why: loop caps limit iterations but not token blowups within a step (thinking tokens, tool-loop context re-send). Cost Guard catches runaway spend even when loop counts are still legal.

Component	Role	AI?
Cost Guard service	Pre-check + post-check on every LLM invocation	No — pure code
Pricing table	`model_id` → input/output $/1M (mirrors Vertex list prices)	No
Estimate resolver	Maps `task_id` + planned steps → `estimated_cost_usd` for `run_id`	No — reads catalog / composite units
Run cost ledger	`run_id` → `{estimated, actual, model_breakdown[]}` in Firestore or Redis	No
Kill switch	Blocks new Vertex calls; signals orchestrator to abort agents	No

Single cost formula (same for catalog planning, run budget, and live metering):

step_cost = (input_tokens / 1e6 × price_in[model])
          + (output_tokens / 1e6 × price_out[model])   # output includes thinking tokens

Only the token source differs between estimate and actual:

Phase	When	Token source	Purpose
Estimate	Run start, before any Vertex call	Catalog In tok / Out tok per planned `task_id` (see cost catalog)	Budget ceiling for `run_id`
Actual	After every Vertex API response	`usage_metadata` on the response — never agent self-report	Running spend tally

Estimate at run start (when router dispatches a gated task):

# Per planned step — same formula as catalog "Cost / run" column
step_estimate = (catalog[task].in_tok / 1e6 × price_in[model])
              + (catalog[task].out_tok / 1e6 × price_out[model])

estimated_run_cost = Σ step_estimate over planned steps   # e.g. opt.cycle + qc.spend
                   × loop_buffer                          # default 1.08 for gated tasks

catalog[task].expected_cost in the table is exactly this math pre-computed; Cost Guard may read either the USD column or recompute from In/Out + pricing_table_version — they must match.

Stored on run_id before the first LLM call. Estimate is immutable for that run unless admin resets (A6).

Actual after each LLM response — Vertex returns token counts on every generate call:

# Map usage_metadata fields (names vary slightly by SDK; normalize in Cost Guard)
input_tokens  = prompt_token_count
              + cached_content_token_count   # billed at cached input rate if applicable
output_tokens = candidates_token_count
              + thoughts_token_count         # 3.5 Flash thinking — billed at output rate

step_cost = (input_tokens / 1e6 × price_in[model])
          + (output_tokens / 1e6 × price_out[model])
actual_run_cost += step_cost

Cost Guard logs per step: run_id, step_index, model, input_tokens, output_tokens, step_cost_usd, actual_run_cost_usd, estimated_run_cost_usd, ratio. Ledger and BigQuery use API-reported tokens only — never prompt guesses or agent claims.

Optional pre-call check (still deterministic): before forwarding a request, Cost Guard may count input tokens in the outbound payload (Vertex tokenizer or count_tokens API) to warn if a single call's prompt alone exceeds 2× catalog.in_tok for that step — does not replace the 3× run-level trip; catches context blowups early.

Trip rule (default):

if actual_run_cost >= trip_multiplier × estimated_run_cost:
    TERMINATE  # default trip_multiplier = 3.0

Scope	Default `trip_multiplier`	On trip
Per `run_id`	3.0× estimated	Kill run; A9 ticket; ops alert
Per tenant / calendar day (optional cap)	Admin-set USD ceiling	Pause all agent LLM calls for tenant until next day or override
Per environment (dev/staging)	Stricter (e.g. 2.0×)	Kill + alert engineering

On terminate (cost.guard.tripped):

Hard stop — orchestrator cancels in-flight agent work; Cost Guard rejects subsequent Vertex requests for that run_id with 403 cost_guard_tripped
Rollback policy — same as A8: rollback or mark needs_review for any partial platform mutations
HITL ticket (A9) — distinct from A8 (loop exhaustion): shows estimated vs actual USD, per-step token breakdown, model used, trip multiplier
Ops alert — red notification to on-call with tenant, run_id, task_id, actual / estimated ratio
No auto-restart — new run_id requires human acknowledgment; optional admin may raise estimate ceiling (A6) with audit log
Emit cost.guard.tripped on event bus

Cost Guard cannot be bypassed by agents, prompts, or orchestrator retries. Only admin override (A6) with logged reason may raise the multiplier or authorize a new run with a higher estimate.

# Illustrative — implementation phase
cost_guard_policy:
  trip_multiplier: 3.0          # stop at 3× estimate
  loop_buffer_in_estimate: 1.08 # baked into estimated_run_cost
  pricing_table_version: "2026-06"  # must match catalog revision
  tenant_daily_cap_usd: null    # optional; e.g. 50.00
  block_on_trip: true           # always true in prod

What monthly costs include

Per-task prices in the cost catalog are single successful runs. Monthly totals explicitly count main + QC as separate line items (e.g. opt.cycle + qc.spend).

Correction loops are budgeted separately:

Assumption	Value	Cost impact
QC fail rate (steady state)	~5–8% with 3.5 Flash QC (was ~8–12% on Flash-Lite)	—
Avg correction loops when failed	1.1 (most pass on 1st retry)	+~6–8% on gated-task LLM spend
Compliance escalation	~2% of compliance QC runs	Already line-itemed in Profile B

Composite unit examples (main + QC + expected loop overhead):

Composite	Formula	Typical cost / unit
Optimization cycle (gated, per track)	`opt.cycle` + `qc.spend` + 8% loop buffer	~$0.15
Campaign build (gated)	`exec.campaign.build` + `qc.compliance` + `qc.tracking` + 8% buffer	~$0.26
Master / full replan	`plan.draft` + `qc.plan` + 8% buffer	~$0.28
New track (branding, engagement)	`plan.track.draft` + `qc.plan` + 8% buffer	~$0.20
Special-day event plan	`plan.event.draft` + `qc.plan` + 8% buffer	~$0.15
Per-track revise	`plan.revise` + `qc.plan` + 8% buffer	~$0.22

Monthly profile totals use separate main/QC counts; add ~6–8% for correction-loop overhead on gated tasks (lower with 3.5 QC), or use composite units above when estimating.

Event bus (conceptual)

Async events (illustrative names):

Event	Producer	Consumers
`client.onboarding.completed`	Onboarding Service	Orchestrator, Planning
`plan.approved`	Approval Engine	Orchestrator, Execution
`optimization.applied`	Optimization Service	Reporting, Plan Revise (drift check)
`plan.revise.recommended`	Optimization / Reporting	Plan Revise Agent, HITL dashboard
`plan.revise.approved`	Approval Engine	Execution, Reporting (rebaseline)
`campaign.live`	Execution Service	Optimization, Reporting
`ga4.tracking.degraded`	Tracking Service	Orchestrator (block spend)
`crm.conversion.batch`	CRM Connector	Tracking Service (CAPI)
`agent.qc.result`	Orchestrator (on every QC pass/fail)	BigQuery `agent_qc_results`, System Ops QC panel
`agent.qc.loop`	Orchestrator (on each correction attempt)	BigQuery `agent_qc_loops`, model promotion policy
`agent.qc.threshold.breached`	QC threshold alert job (scheduled)	Ops alerting, auto-optimize, engineering ticket
`agent.loop.exhausted`	Orchestrator	Human Touch ticket, System Ops traces, ops alerting, audit log
`cost.guard.tripped`	Cost Guard service	Orchestrator kill, HITL (A9), ops alerting, audit log

Implementation may use Pub/Sub, Kafka, or equivalent — TBD with stack.

Model strategy and routing

Principles

Capable, never cheap-at-all-costs — sub-agents use the cheapest tier that still passes QC for that task class. If QC failure rate rises, auto-promote tier (see playbook §4).
Gemini-first on Vertex — native tool calling, context caching, Agent Engine integration, single billing/IAM surface.
gemini-3.5-flash as the primary agentic surface — GA model built for agentic execution, tool loops, and sub-agent deployment. Prefer one capable model + thinking_level per task over mixing many smaller models for agent work. See 3.5 Flash thinking matrix.
Compliance QC on Gemini 3.5 Flash — default compliance gate uses gemini-3.5-flash at high thinking: GA, strong reasoning, structured policy checks, ~50% cheaper than Sonnet on Vertex. Cross-model is optional escalation, not the default (see below).
Cross-model escalation (optional) — claude-sonnet-4-6 only when compliance QC is ambiguous (low confidence), vertical risk flag is health / education, or a human requests a second opinion. Reduces cost while keeping an independent check for edge cases.
Opus / Pro escalation only — claude-opus-4-8 or gemini-3.1-pro-preview with extended thinking reserved for human-triggered or router-scored "hard" tasks.
No autonomous loops on Lite-only models — gemini-2.5-flash-lite is allowed only for deterministic structured I/O (JSON transform, field mapping) with schema validation; not for multi-step tool chains.

Model tiers (summary)

Tier	Models	Use when
T0 — Fast / volume	`gemini-3.1-flash-lite`	Routing, classification, batch report digests only — not gated QC or tool loops
T1 — Agentic workhorse	`gemini-3.5-flash` (default), `gemini-2.5-flash` (fallback)	All tool loops, sub-agents, gated QC checkers, optimization, execution
T2 — Planning / orchestration	`gemini-3.1-pro-preview`, `gemini-3.5-flash` (high), `gemini-2.5-pro` (fallback)	Media plans, orchestrator escalations — Pro when 3.5 high still fails QC repeatedly
T3 — Cross-model escalation	`claude-sonnet-4-6`, `claude-haiku-4-5`	Ambiguous compliance, high-risk verticals, human-requested second opinion
T4 — Rare escalation	`claude-opus-4-8`	Human-requested deep reasoning only
T-batch — Async digest	`gemini-3.1-flash-lite` Batch, `llama-3.3-70b` Batch	Scheduled report summarization, log digests

3.5 Flash thinking matrix (recommended default)

Gemini 3.5 Flash is more capable than 3.1 Flash-Lite and 3 Flash Preview for agentic work: GA, stronger tool use, thought preservation across turns, and near-Pro quality at Flash-tier cost. Google positions it explicitly for sub-agent deployment and rapid agentic loops.

Does using more 3.5 with different thinking_level per task make sense? Yes — for any task that calls tools, blocks mutations, or runs a QC correction loop. Use one endpoint, many depths instead of jumping to Pro or mixing preview models. Reserve Flash-Lite for high-volume non-agentic routing and batch digests only.

Task	Model	`thinking_level`	Why
Router / classify	3.1 Flash-Lite	`minimal`	Pure routing — no tools; millions of calls
Onboarding sub-steps	3.5 Flash	`medium`	Multi-step setup, platform APIs
Execution / optimization sub-agents	3.5 Flash	`medium`	Tool loops; default balances quality + latency
Complex campaign build (multi-platform)	3.5 Flash	`high`	More constraints; fewer correction loops
QC — plan validator	3.5 Flash	`high`	Must catch budget/channel errors before human approval
QC — compliance	3.5 Flash	`high`	Policy + claims reasoning
QC — tracking health	3.5 Flash	`low`	Mostly checklist; fast pass/fail
QC — spend guardrails	3.5 Flash	`medium`	Numeric compare vs plan — code validates math; LLM interprets intent
Report anomaly triage	3.5 Flash	`medium`	Needs reasoning; not mutation-blocking
Daily / weekly digest	3.1 Flash-Lite Batch	`low`	Read-only narrative; cost-sensitive
Media plan draft	3.1 Pro Preview	(Pro thinking)	Long-horizon strategy — 3.5 high is fallback if Pro unavailable
Orchestrator escalation	3.1 Pro Preview	(Pro thinking)	Conflict resolution across agents

Quality vs cost tradeoff: Putting gated QC on 3.5 Flash (vs a cheaper Flash-Lite QC mix) raises LLM spend by roughly $4–6 / client / month on Standard, but buys fewer correction loops (8% buffer vs ~12%) and a higher first-pass QC rate — usually net-positive once human-review time is counted. Do not run the router or batch digests on 3.5; that would 3–6× those rows for no quality gain. Expected Standard total is **$25.60 / client / month** (see cost catalog and cost band).

Thinking levels (`gemini-3.1-flash-lite` only)

For T0 tasks on Flash-Lite per Google's guidance:

Level	Agent tasks
`minimal`	Router, intent classification
`low`	Batch report KPI bullets, simple field extraction

Model catalog (Vertex AI, June 2026)

Source: Vertex AI Generative AI pricing — Global endpoint, standard (non-batch) rates per 1M tokens, prompts ≤200K. Cached input shown where available. Verify before implementation; preview models may change.

Primary — Gemini (Google)

Model	Input / 1M	Output / 1M	Cached input / 1M	Designed for	Kobi use
gemini-3.1-flash-lite (GA)	$0.25	$1.50	$0.025	High-volume agentic workflows, tool calling, routing (docs)	Default sub-agent + router + checklist QC
gemini-3.5-flash (GA)	$1.50	$9.00	$0.15	Frontier-level Flash; agentic execution, coding, long-horizon tool use (docs)	Compliance QC, Execution, Optimization (preferred GA workhorse)
gemini-3-flash-preview	$0.50	$3.00	$0.05	Agentic workhorse, multimodal, computer use (docs)	Fallback if 3.5 unavailable; Onboarding
gemini-3.1-pro-preview	$2.00	$12.00	$0.20	Deep agentic reasoning, precise tool use, 1M context (docs)	Orchestrator, Media Plan
gemini-2.5-flash (GA)	$0.30	$2.50	$0.03	Stable agentic GA; improved tool use (Google blog)	Production fallback for T1
gemini-2.5-pro (GA)	$1.25	$10.00	$0.13	Complex reasoning, coding	Production fallback for T2
gemini-2.5-flash-lite (GA)	$0.10	$0.40	$0.01	Ultra-cheap classification/summarization	Structured transform only — not autonomous tool loops

Secondary — Anthropic on Vertex (QC & cross-check)

Model	Input / 1M	Output / 1M	Designed for	Kobi use
claude-haiku-4.5	$1.00	$5.00	Fast, cost-effective classification (Anthropic pricing)	Lightweight policy JSON scans
claude-sonnet-4.6	$3.00	$15.00	Agents, coding, enterprise workflows at scale	Compliance escalation — ambiguous QC, health/education high-risk, human-requested
claude-opus-4.8	$5.00	$25.00	Hardest agentic + coding tasks	Human-triggered escalation only

Optional — Meta Llama on Vertex (batch / cost experiments)

Model	Input / 1M	Output / 1M	Designed for	Kobi use
llama-3.3-70b	$0.72	$0.72	Efficient text tasks	Batch report narrative (optional)
llama-4-maverick	$0.35	$1.15	Multimodal reasoning, tool calling	Creative brief / asset review (optional)

Models we do not default to

Model	Reason
Grok, Nemotron, Mistral OCR, TTS, embedding-only	Wrong task fit or no advantage over Gemini/Claude for media-ops agents
`gemini-2.5-flash-lite` for tool loops	Weak agentic benchmarks (SWE-bench ~32%); acceptable only with strict JSON schema + no tools
Opus / 3.1 Pro for sub-agents	10–50× token cost vs Flash-Lite with no QC benefit on simple tasks

Per-task LLM cost catalog

Vertex global standard pricing. Cost formula:

cost = (input_tokens / 1M × input_rate) + (output_tokens / 1M × output_rate)

Token assumptions are planning defaults for a single successful run of that step. 3.5 Flash output column includes thinking tokens billed at output rate. QC checker runs are separate line items — gated main tasks always incur main + QC (see pairing table). Correction loops add ~6–8% on gated tasks with 3.5 QC (see loop limits). Context caching on system prompts typically reduces input cost 30–50% in steady state — figures below are without cache (conservative).

Task ID	Task	Agent / layer	Model	Thinking	In tok	Out tok†	Cost / run
`route.dispatch`	Route Pub/Sub event to agent + playbook	Router	3.1 Flash-Lite	minimal	3K	0.5K	$0.0015
`route.classify`	Classify complexity + model tier	Router	3.1 Flash-Lite	low	5K	0.8K	$0.0025
`onboard.step`	One onboarding sub-step (account, tag, verify)	Onboarding sub	3.5 Flash	medium	12K	4.5K	$0.059
`plan.draft`	Master plan or full replan (envelope + tracks)	Media Plan	3.1 Pro Preview	—	30K	10K	$0.18
`plan.track.draft`	New track (branding, engagement, always-on split)	Media Plan	3.1 Pro Preview	—	20K	6K	$0.11
`plan.event.draft`	Special-day / event flight (time-boxed)	Media Plan	3.5 Flash	medium	14K	4.5K	$0.062
`plan.revise`	Per-track revise vN+1 from opt log + manifest	Plan Revise	3.1 Pro Preview	—	22K	7K	$0.13
`report.plan_drift`	Per-track drift + changelog (may batch tracks)	Reporting	3.5 Flash	low	25K	4K	$0.074
`qc.plan`	Plan validator vs brief	QC	3.5 Flash	high	18K	5.5K	$0.077
`qc.compliance`	Creative / copy policy gate	QC	3.5 Flash	high	12K	4K	$0.054
`qc.compliance.escalate`	Cross-model second opinion (rare)	QC	Sonnet 4.6	—	12K	2.5K	$0.074
`qc.tracking`	Tracking health sweep	QC	3.5 Flash	low	10K	2K	$0.033
`qc.spend`	Spend guardrail vs approved plan	QC	3.5 Flash	medium	10K	2.5K	$0.038
`exec.campaign.build`	Build one platform campaign slice	Execution sub	3.5 Flash	medium	40K	10K	$0.15
`exec.campaign.mutate`	Pause, budget nudge, status change	Execution sub	3.5 Flash	medium	15K	4K	$0.059
`opt.cycle`	Analyze performance + propose change	Optimization	3.5 Flash	medium	22K	7.5K	$0.101
`feed.validate`	Feed / catalog batch validation	Onboarding / Execution sub	3.5 Flash	low	8K	2K	$0.030
`report.daily`	Daily KPI digest	Reporting	3.1 Flash-Lite Batch	low	35K	4K	$0.007
`report.weekly`	Weekly client narrative	Reporting	3.1 Flash-Lite	low	45K	6K	$0.020
`report.anomaly`	Anomaly triage + recommendation	Reporting	3.5 Flash	medium	28K	9K	$0.123
`orch.escalate`	Orchestrator re-plan / conflict resolution	Orchestrator	3.1 Pro Preview	—	20K	5K	$0.10

† For gemini-3.5-flash, Out tok = visible output + thinking tokens (billed at output rate per Vertex pricing).

Batch API for scheduled / non-urgent tasks

Vertex Batch API = 50% discount on eligible Gemini models, in exchange for async completion (up to ~24h). Use it only for scheduled, non-blocking work — never for real-time routing, gated QC, platform mutations, interactive planning, or anomaly alerts (<1h SLA).

Task	Standard / run	Batch / run (−50%)	Eligible?
`report.daily`	$0.007	$0.007	✓ already batched
`report.weekly`	$0.020	$0.010	✓ scheduled
`report.plan_drift`	$0.074	$0.037	✓ scheduled (per-track rollup)
`feed.validate` (scheduled catalog sweep)	$0.030	$0.015	✓ when not pre-launch gating
`report.monthly` (exec summary)	~$0.15	~$0.075	✓ scheduled
`route.`, `plan.`, `qc.`, `exec.`, `opt.cycle`, `report.anomaly`, `orch.escalate`	—	—	✗ real-time / gated / interactive

Reality check: batch trims the reporting + feed-sweep slice only (~2–4% of total). The real cost drivers — optimization cycles, execution mutations, QC gates — are real-time and stay at standard rates. Batch is worthwhile (free 50% on eligible tasks) but is not where the savings concentrate; context caching matters far more.

Monthly task volume & LLM cost by client profile

Illustrative steady-state volumes (after onboarding month). Costs assume multiple concurrent plan tracks per client (always-on, branding, engagement, special-day events) — not a single monolithic plan. See plan tracks.

QC pairing: each plan.* row includes + qc.plan ($0.077) unless noted.

Profile A — Starter (Google + Meta, low activity)

~$15–25K/mo media spend, 2 active tracks (always-on + seasonal/event), optimization 2–3×/week.

Task ID	Runs / month	Unit cost (incl. QC where gated)	Monthly
`route.dispatch`	320	$0.0015	$0.48
`route.classify`	110	$0.0025	$0.28
Planning (multi-track)
`plan.draft` master refresh	0.25	$0.257	$0.06
`plan.track.draft` + `qc.plan`	0.5	$0.187	$0.09
`plan.event.draft` + `qc.plan`	1	$0.139	$0.14
`plan.revise` + `qc.plan`	2	$0.207	$0.41
`report.plan_drift` (2 tracks)	4	$0.074	$0.30
`qc.compliance`	5	$0.054	$0.27
`exec.campaign.build`	6	$0.15	$0.90
`exec.campaign.mutate`	4	$0.059	$0.24
`opt.cycle` + `qc.spend`	24 + 24	$0.139	$3.34
`qc.tracking`	30	$0.033	$0.99
`report.daily` + `report.weekly`	30 + 4	—	$0.29
`report.anomaly`	2	$0.123	$0.25
Correction-loop overhead (~8% of gated)	—	—	~$0.52
		LLM subtotal (standard)	~$8.55

With context caching (40% input savings): **$6.40–6.90 / client / month**
With caching + batch on scheduled reports (saves $0.19): **$6.25–6.70 / client / month**

Profile B — Standard (Google + Meta + GA4 + CRM, active optimization)

~$50–150K/mo media spend, 3–4 active tracks (always-on, branding, engagement, seasonal), daily optimization.

Task ID	Runs / month	Unit cost (incl. QC where gated)	Monthly
`route.dispatch`	700	$0.0015	$1.05
`route.classify`	240	$0.0025	$0.60
Planning (multi-track)
`plan.draft` master refresh	0.25	$0.257	$0.06
`plan.track.draft` + `qc.plan`	1	$0.187	$0.19
`plan.event.draft` + `qc.plan`	2	$0.139	$0.28
`plan.revise` + `qc.plan`	4	$0.207	$0.83
`report.plan_drift` (3–4 tracks)	8	$0.074	$0.59
`qc.compliance`	14	$0.054	$0.76
`exec.campaign.build`	22	$0.15	$3.30
`exec.campaign.mutate`	14	$0.059	$0.83
`opt.cycle` + `qc.spend`	95 + 95	$0.139	$13.21
`qc.tracking`	30	$0.033	$0.99
`report.daily` + `report.weekly`	30 + 4	—	$0.29
`report.anomaly`	6	$0.123	$0.74
`orch.escalate`	2	$0.10	$0.20
`qc.compliance.escalate`	1	$0.074	$0.07
Correction-loop overhead (~8% of gated)	—	—	~$1.64
		LLM subtotal (standard)	~$25.60

With context caching (40% input savings): **$19–22 / client / month**
With caching + batch on scheduled reports (saves $0.34): **$18.70–21.70 / client / month**

Profile C — Ecommerce (feeds, catalog, high optimization)

~$150K+/mo media spend, 4–6 active tracks (+ flash sales, catalog pushes), hourly pacing.

Task ID	Runs / month	Unit cost (incl. QC where gated)	Monthly
Profile B planning + ops base	—	—	$25.60
Extra planning (events / tracks)
`plan.event.draft` + `qc.plan`	+2	$0.139	+$0.28
`plan.revise` + `qc.plan`	+2	$0.207	+$0.41
`plan.track.draft` + `qc.plan`	+0.5	$0.187	+$0.09
`report.plan_drift`	+4	$0.074	+$0.30
`feed.validate`	30	$0.030	$0.90
Extra `opt.cycle` + `qc.spend`	+45 +45	$0.139	+$6.26
Extra `exec.campaign.build`	+12	$0.15	+$1.80
Extra `qc.compliance`	+6	$0.054	+$0.32
Extra `route.dispatch`	+250	$0.0015	+$0.38
Correction-loop overhead (extra gated)	—	—	~$0.80
		LLM subtotal (standard)	~$37.15

With context caching (40% input savings): **$27–31 / client / month**
With caching + batch on scheduled reports + feed sweeps (saves $0.93): **$26.50–30 / client / month**

Onboarding month (one-time, any profile)

Task ID	Runs (typical)	Unit cost	One-time
`onboard.step`	18–25	$0.059	$1.05–1.48
`route.dispatch` + `route.classify`	80 + 40	$0.0015 + $0.0025	$0.22
`qc.tracking` (setup)	10	$0.033	$0.33
		Onboarding LLM	~$1.60–2.00

Portfolio-level estimate

Portfolio size	Profile mix	Standard (no cache)	+ caching	+ caching & batch
10 clients	6 Starter + 4 Standard	~$154	~$120	~$118
50 clients	25 + 20 + 5 Ecommerce	~$912	~$712	~$707
200 clients	100 + 80 + 20	~$3,650	~$2,845	~$2,830

Assumes multi-track planning volumes above; actuals scale with count of active event/sale plans per client. Batch column reflects scheduled reporting + feed sweeps only — the marginal step beyond caching is small because cost concentrates in real-time optimization, execution, and QC.

Note: These are Vertex LLM inference only — not media spend, Kobi management fees, Cloud Run, BigQuery, or platform API costs. Log actual run_id token usage to BigQuery per tenant_id for invoice-grade allocation if pass-through is contracted. See Billing & invoicing for client-facing monthly invoices (media spend + fees).

Realistic cost band (treat point estimates as the expected case)

The per-task tokens above are single-shot, expected-case assumptions. Two factors can push real cost up materially; size budgets with a band, not a point:

Factor	Effect	Multiplier on affected tasks
Thinking tokens at `high`	3.5 Flash `high` can emit far more reasoning than the 4–5.5K assumed for `qc.plan` / `qc.compliance` / multi-constraint builds	output tokens ×1.5–2.5
Tool-loop context re-send	Each tool round re-sends growing context; catalog prices `exec.*` / `opt.cycle` as a single shot	input tokens ×2–4 over 3–5 rounds
Transient retries	Orchestrator retry (1×) + occasional platform re-fetch	+5–10% on tool tasks

Profile	Low (caching + batch)	Expected (standard)	High (thinking + tool-loops)
Starter	~$6.40	~$8.55	~$15–18
Standard	~$19	~$25.60	~$48–55
Ecommerce	~$27	~$37.15	~$72–82

Even the high case stays a tiny fraction of media spend and Kobi fees — but design for the high band, then measure real run_id usage in a pilot to replace these assumptions.

Actual cost is dominated by optimization frequency and multi-turn tool loops — the playbook below targets both.

Token-efficiency playbook

Goal: maximum quality per token — structured outputs, cached context, minimal re-prompting.

1. Prompt architecture

Rule	Implementation
Split system vs task	Stable system prompt + tool schemas → Vertex context cache (refresh only on playbook version bump)
Structured outputs only	Agents return JSON matching versioned schemas (`plan_vN`, `mutation_manifest`, `qc_result`) — no prose in tool paths
Reference by ID	Pass `tenant_id`, `plan_version`, `entity_id` — never re-embed full plan in every sub-agent call; sub-agents fetch slice via tool
One job per sub-agent	e.g. `meta.campaign.create` not `meta.everything` — smaller context, fewer hallucinated side effects

2. Context budget

Context type	Max tokens (target)	Cache?
System + tool schemas	≤ 8K	Yes
Tenant playbook snippet	≤ 2K	Yes
Task payload (this step only)	≤ 4K	No
Platform API response (trimmed)	≤ 6K	No
Total per sub-agent call	≤ 20K	—

Trim platform API responses to fields the schema requires. Store full responses in GCS/BQ; pass URI to reporting agents only.

3. Tool-call discipline

Plan tools explicitly — max 3 tools per sub-agent invocation unless router scores complex.
Idempotent tools — safe to retry; orchestrator dedupes by run_id.
No LLM in hot path for math — budget splits, bid deltas, % thresholds computed in code; LLM proposes intent, code validates numbers.
Batch non-urgent work — route all scheduled, non-blocking tasks (daily/weekly/monthly digests, per-track drift reports, scheduled feed sweeps) through Vertex Batch API for a 50% discount on eligible Gemini models. Never batch real-time routing, gated QC, mutations, or anomaly alerts. See Batch API for scheduled tasks.

4. Model promotion / demotion (automatic)

Signal	Action
QC fail rate > 5% for task class (rolling 24h)	Promote task class one tier (e.g. Flash-Lite → 3 Flash)
QC success rate < 80% (task or subtask grain)	Alert + immediate promote + thinking bump — see QC success threshold alerts
QC pass rate > 99% for 7d on task class	Trial demote one tier; revert if fails spike
p95 latency > SLA	Lower thinking level or switch to Flash-Lite
Human escalation on reasoning	Pin task class to T2+ for 30d

5. Playbook registry (per tenant / vertical)

Versioned artifacts cached in Vertex context cache:

Playbook	Contents	Changes when
`playbook.routing`	Task class → model tier + thinking level	Model catalog update
`playbook.vertical.{health\|school\|…}`	Compliance keywords, blocked claims, KPI defaults	Vertical config change
`playbook.platform.{google\|meta\|…}`	Tool allowlist, API field maps, rate-limit hints	Platform spec update
`playbook.qc`	QC agent checklist templates	Policy change

Router selects playbook IDs; agents never inline full vertical rules in every prompt.

6. Output quality without extra tokens

Two-pass by design: every gated main task → paired QC checker (see §5). Max 2 correction loops, then human. No third QC pass unless compliance escalates to Sonnet.
Confidence gate: if router confidence < 0.85 on classification, escalate to medium thinking or T1 model — cheaper than a failed tool loop + retry.
Log token usage per run_id, agent, model, thinking_level → BigQuery for cost attribution per tenant.
Log every QC loop — agent.qc.result + agent.qc.loop with agent, model, task, playbook versions, and input/doc refs (see QC loop telemetry).

Failure handling

Failure	Behavior
Platform API rate limit	Exponential backoff; partial apply with rollback marker
Agent timeout	Orchestrator retries once; then red flag (A8) if still failing
Loop limit exceeded	Red flag (A8) + `agent.loop.exhausted`; no further auto-retry
Cost Guard trip (≥3× estimate)	Terminate run_id; A9 + `cost.guard.tripped`; all Vertex calls blocked
Guardrail violation	Block action; create approval ticket
Tracking down	Pause optimization spend increases; alert ops
Repeated A8 on same task class (≥3 in 24h)	Escalate to engineering + optional tenant pause
Repeated A9 on same task class (≥3 in 24h)	Review pricing table / estimate resolver; tighten `trip_multiplier` or task catalog
QC success < 80% (task/subtask, min sample)	`agent.qc.threshold.breached` → ops alert + auto-optimize; engineering ticket

GCP deployment topology — AI & agent runtime
System overview
Human control plane
System Ops Dashboard
Lifecycle: optimization
Vertex AI pricing (verify at implementation)

Overview

Agent hierarchy

Agent roster

State machine (client lifecycle)

Orchestration patterns

1. Plan–execute separation

2. Guardrails (hard limits)

3. Tool access

4. Observability

5. Per-task QC gates and correction loops

Main task → QC checker pairing

Loop limits (hard caps — no infinite loops)

Red flag on loop exhaustion

QC loop telemetry

QC success threshold alerts (80% floor)

Cost Guard — deterministic spend circuit breaker (not AI)

What monthly costs include

Event bus (conceptual)

Model strategy and routing

Principles

Model tiers (summary)

3.5 Flash thinking matrix (recommended default)

Thinking levels (gemini-3.1-flash-lite only)

Model catalog (Vertex AI, June 2026)

Primary — Gemini (Google)

Secondary — Anthropic on Vertex (QC & cross-check)

Optional — Meta Llama on Vertex (batch / cost experiments)

Models we do not default to

Per-task LLM cost catalog

Batch API for scheduled / non-urgent tasks

Monthly task volume & LLM cost by client profile

Profile A — Starter (Google + Meta, low activity)

Profile B — Standard (Google + Meta + GA4 + CRM, active optimization)

Profile C — Ecommerce (feeds, catalog, high optimization)

Onboarding month (one-time, any profile)

Portfolio-level estimate

Realistic cost band (treat point estimates as the expected case)

Token-efficiency playbook

1. Prompt architecture

2. Context budget

3. Tool-call discipline

4. Model promotion / demotion (automatic)

5. Playbook registry (per tenant / vertical)

6. Output quality without extra tokens

Failure handling

Related documents

Thinking levels (`gemini-3.1-flash-lite` only)