Execution & Cost
Cost Model & Estimates (Total Cost to Operate the Module)
Purpose: A single, auditable view of what it costs Kobi to run the Digital Ads module — LLM inference + Vertex Agent Engine + core GCP infra + data warehouse + (optional) first-party relay + one-time build. This complements the detailed LLM cost catalog (kept as the source for token-level numbers) by consolidating every cost layer into per-tenant and portfolio totals.
Audience: Parent-project / VC finance and engineering. Commercial figures (pricing, media take-rate, revenue) are owned at the parent/VC level — this doc is the cost side only, so those numbers can be slotted against it.
Confidence: Planning bands. Unit prices verified June 2026 against the sources in §1; re-verify before any commitment — cloud and model prices drift, and several Gemini/Agent-Engine SKUs are new in 2026.
Related: Agentic orchestration — LLM cost catalog · GCP topology · First-party tag relay — cost · Execution gameplan
0. Scope — what is and isn't counted
| Counted (Kobi-incurred operating cost) | Excluded (and why) |
|---|---|
| LLM inference (Vertex Model Garden tokens) | Media spend pass-through — this is client budget Kobi fronts/invoices; it is working capital, not opex (see §7) |
| Vertex AI Agent Engine runtime (compute/memory/sessions) | Parent Kobi platform shared services (core CRM product, identity, billing ERP) — separate budget |
| Cloud Run (orchestrator, connectors, BFFs, Cost Guard, jobs) | Client GA4 → BigQuery / MMP export — client-operated, out of scope |
| BigQuery (Google Ads export + agent/QC telemetry) | Human ops/eng salaries — headcount, not infra |
| Pub/Sub, Firestore/Cloud SQL, Secret Manager, GCS, Logging | DV360 minimum-spend commitment — a commercial contract term, not infra |
| First-party relay (Phase 2+, optional SKU) | Platform API fees — Google/Meta/TikTok ad APIs are free to call (rate-limited, not metered) |
| One-time engineering build (§6) | Third-party CMP licenses (only if Kobi resells one) |
Tenant profiles (from media-planning) and portfolio sizes (Pilot 5 / Growth 50 / Scale 200) match the existing docs so numbers reconcile.
1. Unit prices (verified — source table)
Vertex global standard (non-batch) rates; GCP Tier-1 region (e.g.
europe-west1). First-tier free allowances noted.
| Resource | Unit price | Free tier / month | Source |
|---|---|---|---|
| Gemini 3.1 Flash-Lite | $0.25 in / $1.50 out per 1M tok | — | Vertex pricing |
| Gemini 3.5 Flash | $1.50 in / $9.00 out per 1M tok | — | Vertex pricing |
| Gemini 3.1 Pro Preview | $2.00 in / $12.00 out per 1M tok | — | Vertex pricing |
| Claude Sonnet 4.x / Opus 4.x (escalation) | $3/$15 · $5/$25 per 1M tok | — | Anthropic on Vertex |
| Vertex Batch API | −50% on eligible Gemini | — | Vertex pricing |
| Context caching | ~−30–50% on cached input | — | Vertex pricing |
| Agent Engine — compute | $0.0864 / vCPU-hour | 50 vCPU-hours | Agent Engine pricing |
| Agent Engine — memory | $0.0090 / GiB-hour | 100 GiB-hours | Agent Engine pricing |
| Agent Engine — sessions | $0.25 / 1,000 events | — | Agent Engine pricing (billing since Feb 11 2026) |
| Cloud Run — vCPU | $0.000024 / vCPU-second (~$0.0864/hr) | 180,000 vCPU-s | Cloud Run pricing |
| Cloud Run — memory | $0.0000025 / GiB-second (~$0.009/hr) | 360,000 GiB-s | Cloud Run pricing |
| Cloud Run — requests | $0.40 / million | 2,000,000 | Cloud Run pricing |
| BigQuery — query (on-demand) | $6.25 / TiB scanned | 1 TiB | BigQuery pricing |
| BigQuery — active storage | $0.02 / GiB-month (long-term $0.01) | 10 GiB | BigQuery pricing |
| Pub/Sub | $40 / TiB throughput | 10 GiB | Pub/Sub pricing |
| Secret Manager | $0.06 / active version-month; $0.03 / 10k access | 6 versions | Secret Manager pricing (verify) |
| Cloud Logging | $0.50 / GiB ingested | 50 GiB | Logging pricing (verify) |
| Cloud Storage (standard) | ~$0.020 / GB-month | 5 GB | GCS pricing (verify) |
| Network egress | ~$0.12 / GB (internet) | 1 GB | (verify) |
| Platform ad APIs (Google/Meta/TikTok/DV360) | $0 (free, rate-limited) | — | per platform docs |
2. Recurring cost by layer
2.1 LLM inference (the dominant variable cost)
From the per-task catalog. Expected (standard, no cache), per tenant / month:
| Profile | Expected | + caching (~−25–30%) | High band (thinking + tool-loops) |
|---|---|---|---|
| Starter | $8.55 | ~$6.40 | ~$15–18 |
| Standard | $25.60 | ~$19–22 | ~$48–55 |
| Ecommerce | $37.15 | ~$27–31 | ~$72–82 |
| Onboarding (one-time) | $1.60–2.00 | — | — |
2.2 Vertex Agent Engine runtime (agent compute, separate from tokens)
Agent Engine bills the container wall-time of agent invocations (model tokens billed separately in §2.1). Idle is not billed; first 50 vCPU-h + 100 GiB-h/month are free (absorbs the pilot almost entirely).
Assumptions: ~1 vCPU + 2 GiB per invocation; avg ~15 s wall-time/invocation; invocation counts from the task catalog; ~3 session events/invocation.
| Profile | Invocations/mo | vCPU-h | Compute+mem | Sessions | Agent Engine /tenant |
|---|---|---|---|---|---|
| Starter | ~500 | ~2.1 | ~$0.20 | ~$0.38 | ~$0.6 |
| Standard | ~1,400 | ~5.8 | ~$0.61 | ~$1.05 | ~$1.7 |
| Ecommerce | ~2,300 | ~9.6 | ~$1.00 | ~$1.70 | ~$2.7 |
Free tier covers ~8 vCPU-h before billing — i.e. the first 1–2 tenants are effectively free; the per-tenant numbers above apply at steady-state scale.
2.3 Cloud Run (orchestrator control plane, connectors, BFFs, Cost Guard, jobs)
Two parts: fixed warm services (see §4) and per-tenant request/compute (connector mutations, polling, reporting). Requests are negligible at $0.40/M; compute is short bursts.
| Profile | Per-tenant Cloud Run (variable) |
|---|---|
| Starter | ~$0.20 |
| Standard | ~$0.40 |
| Ecommerce | ~$0.60 |
2.4 Data warehouse (BigQuery: Google Ads export + telemetry)
Per tenant: Google Ads export storage is tiny (tens of MB/mo for an SMB); reporting/optimization joins scan a few GB/mo; QC/Cost-Guard telemetry adds rows. Partitioning by tenant_id + clustering keeps scans small (the docs already specify this).
| Profile | Per-tenant BigQuery (storage + query) |
|---|---|
| Starter | ~$0.15 |
| Standard | ~$0.30 |
| Ecommerce | ~$0.50 |
Risk lever: un-partitioned queries can 10–100× this. Enforce partition/cluster +
SELECTonly needed columns (already in the playbook).
2.5 Other GCP (Pub/Sub, Secret Manager, GCS, Logging, egress) — per tenant
| Item | Per-tenant /mo | Note |
|---|---|---|
| Secret Manager | ~$0.25 | ~4 platform-token versions/tenant |
| Pub/Sub | <$0.05 | small events; $40/TiB |
| GCS (artifacts/feeds) | <$0.10 | $0.02/GB |
| Logging | ~$0.05–0.20 | grows at scale; 50 GiB free |
| Egress (fan-out) | <$0.10 | small payloads |
| Subtotal "other" | ~$0.5–0.7 |
2.6 First-party relay (Phase 2+, optional SKU)
From relay cost: ~$1–3/tenant (low traffic) to $6–12 (high traffic); ~$80–150/mo total at 50 tenants (hybrid Cloudflare + Cloud Run). Deferred / not in base run-cost below unless the relay SKU is sold.
2.7 Platform API costs
$0. Google Ads, Meta Marketing, TikTok Marketing, DV360, GA4 Data/Admin APIs are free to call — the constraint is rate limits and approval tiers, not metered fees (see platform access).
3. Per-tenant total cost of ownership (blended, excl. relay & media)
Sum of §2.1 (expected) + §2.2 + §2.3 + §2.4 + §2.5:
| Profile | LLM | Agent Eng. | Cloud Run | BQ | Other | Per-tenant /mo (expected) | With caching | High band |
|---|---|---|---|---|---|---|---|---|
| Starter | $8.55 | $0.6 | $0.2 | $0.15 | $0.6 | ~$10.1 | ~$8 | ~$20 |
| Standard | $25.60 | $1.7 | $0.4 | $0.30 | $0.6 | ~$28.6 | ~$22–25 | ~$58 |
| Ecommerce | $37.15 | $2.7 | $0.6 | $0.50 | $0.7 | ~$41.7 | ~$31–35 | ~$87 |
Non-LLM infra is ~10–15% of per-tenant cost; LLM dominates. Optimization frequency and tool-loop depth are the real levers (§5). Add +$1–12/tenant if the relay SKU is enabled.
4. Fixed baseline infra (independent of tenant count)
Costs incurred even at zero/low tenants — mostly warm prod services + non-prod environments.
| Item | Monthly band |
|---|---|
| Cloud Run warm services in prod (dashboard BFF warm; rest scale-to-zero — see §4.1) | $30–70 |
| dev + staging environments (min-instances 0; light use) | $30–60 |
| Tenant registry / run-cost ledger (Cloud SQL small or Firestore) | $30–80 |
| Logging + Monitoring baseline | $20–50 |
| BigQuery baseline storage + scheduled jobs | $5–20 |
| Secret Manager base, Scheduler, misc | $10–20 |
| Fixed baseline total | ~$130–300 / mo (plan ~$220) |
Vertex Agent Engine + Cloud Run + BigQuery free tiers absorb most pilot-scale usage, so at 1–5 tenants the bill is dominated by this fixed baseline, not per-tenant cost.
4.1 Cutting the warm-service baseline
Warm (min-instance) services are the largest fixed cost, so they're worth optimizing. Conclusion: cut warm cost aggressively, but stay on Cloud Run — do not move the control plane to VMs.
Why not VMs (the serverless control plane wins here):
- The control plane is mostly async/event-driven — orchestrator, connectors, Cost Guard, and jobs are Pub/Sub-triggered, so they don't need to be warm at all; they should scale to zero. Only the human-dashboard BFF is latency-sensitive enough to keep warm.
- "Scale resources by client amount" is already what Cloud Run does — per-request, concurrency-based autoscaling with per-tenant isolation. A hand-resized VM is worse at this: noisy-neighbor across tenants, manual vertical scaling (downtime), and no scale-to-zero.
- A VM saves only ~$50–150/mo but adds OS patching, container orchestration, health checks, capacity planning — ops/eng time that costs far more than the saving and breaks the serverless-first principle (system overview). Startup credits (§9.1) zero this line in Year 1 anyway.
The real levers (all within Cloud Run) — together ~−60–70% on the warm line:
| Lever | How | Effect |
|---|---|---|
| Scale-to-zero for async services | min-instances=0 on orchestrator workers, connectors, Cost Guard, jobs (cold start fine for Pub/Sub work) |
removes most warm floors |
| Warm only the dashboard BFF | min-instances=1 only on the synchronous user-facing surface |
one floor, not four |
| Schedule warmth to business hours | Cloud Scheduler sets BFF min-instances 1 in work hours, 0 nights/weekends |
|
| Request-based billing (CPU throttled idle) | warm idle instances bill mostly memory ($0.009/GiB-h), not full vCPU ($0.0864/vCPU-h) | ~−60–80% on idle cost |
| Consolidate services | each warm service has its own min-instance floor — fewer services = less baseline | linear on floor count |
| Cloud Run CUD on the residual warm floor | instance-based services 28%/46% (1/3-yr) — §9.2 | −28–46% on what's left, post-credits |
Where a VM does pay: long-running batch (e.g. heavy continuous optimization sweeps) on Spot VMs (~−60–91%). But Cloud Run Jobs already scale to zero, so a Spot VM only wins if the batch runs continuously for hours — not the case at pilot/growth scale. Revisit only if a sustained batch workload emerges.
5. Portfolio totals (all-in monthly, expected, excl. relay & media)
Variable = Σ per-tenant (§3); plus fixed baseline (§4). Mix mirrors the existing docs.
| Portfolio | Mix | LLM | Non-LLM infra (variable) | Fixed | Total /mo (no cache) | With caching |
|---|---|---|---|---|---|---|
| Pilot — 5 | 3 Starter + 2 Standard | ~$77 | ~$11 | ~$150 | ~$240 | ~$220 |
| Growth — 50 | 25 + 20 + 5 Ecom | ~$912 | ~$120 | ~$250 | ~$1,280 | ~$1,080 |
| Scale — 200 | 100 + 80 + 20 Ecom | ~$3,650 | ~$480 | ~$350 | ~$4,480 | ~$3,675 |
Fixed column reflects the trimmed warm baseline from §4.1 (dashboard BFF warm, rest scale-to-zero); it grows modestly with logging/registry as tenants scale.
Blended all-in per tenant (incl. amortized fixed): Pilot ~$48 (fixed-heavy), Growth ~$26, Scale ~$22 (no cache) / ~$18 (cached).
Add relay if sold: +
$80–150/mo at Growth, +$350–650/mo at Scale. Reference point: at Scale, total infra+LLM ≈ $4–5K/mo for 200 clients — a rounding error against the media budgets being managed (see §7).
6. One-time build cost (engineer-weeks)
Derived from the roadmap workstreams (summed effort, not calendar). Expressed in engineer-weeks so the parent/VC can apply their own loaded rate.
| Block | Engineer-weeks |
|---|---|
| Phase 1 — foundations + Google + GA4 + onboarding + planning/exec + reporting | ~15 |
| Cross-cutting — orchestrator + agents + HITL + optimization agent | ~17 |
| Phase 2 — Meta + TikTok + onboarding extend + feed + optimization v1 | ~11 |
| Phase 4 — CRM loop + CAPI + match monitoring + closed-loop reporting | ~8 |
| Core subtotal | ~51 |
| Phase 3 — DV360 (only if contract) | +6 |
| First-party relay (optional SKU) | +10–12 |
| With DV360 + relay | ~67–69 |
Calendar: ~18–26 weeks with 2–4 engineers (matches roadmap). Illustrative cost = engineer-weeks × your loaded weekly rate (rate is parent/VC-owned). The MVP cut (gameplan §4) defers ~15–20 of these weeks (Cost-Guard/QC-telemetry/model-promotion/relay/DV360) to protect the pilot date.
Build-phase cloud cost is negligible (free tiers + dev min-instances 0): plan ~$100–300/mo during build for dev/staging + test inference.
7. Media float / working capital (not opex)
The single largest cash item is not in the tables above because it is not Kobi's cost — it is client media budget Kobi fronts and recovers under the agency-billing model:
- Kobi pays platforms (often on extended credit) and invoices clients monthly → Kobi carries ~1 month of every client's media spend as a receivable.
- At Growth (50 clients × e.g. ₺50–150K/mo media), the float can be ₺millions — orders of magnitude larger than the ~$1.2K/mo infra cost.
- This is a balance-sheet/financing question, not an operating cost, but it must be planned. Engineering lock (ADR 0004): every tenant has a per-tenant spend limit (
credit_sub_limit+ platform caps). If clients prepay, the limit tracks prepaid balance from the parent billing module's internal API; if payment works differently, finance / parent billing must define the limit source before prod. - Flagged in gameplan B3.
- Tax on media spend (VAT/KDV + foreign-platform digital-services/withholding tax) further grosses up the invoiced amount and the float; tax treatment is parent/VC finance-owned — see vision & scope — billing. Not in the infra cost tables.
8. Sensitivity & levers
| Lever | Effect | Where |
|---|---|---|
| Context caching on system prompts/playbooks | −25–50% LLM input cost | biggest single saving |
| Optimization frequency (cycles/mo per track) | ~linear on the largest LLM line | tune per plan tier |
| Model tier discipline (Flash vs Pro/Opus) | 3–50× per call | router + promotion policy |
| Batch API for scheduled reports/feed sweeps | −50% on ~2–4% of total | small but free |
| BigQuery partition/cluster | 10–100× on query scans | enforce in code |
| Cloud Run min-instances | each warm vCPU ≈ $63/mo | keep 0 where latency allows |
Thinking level at high |
output tokens ×1.5–2.5 | reserve for gated QC |
| Tool-loop depth | input tokens ×2–4 over rounds | cap loops (already 5/8) |
| Startup credits (Google for Startups) | ~100% of GCP+Gemini, Year 1 | biggest lever — see §9.1 |
| Cloud Run CUD (commit baseline) | −17% to −46% on Cloud Run only | steady-state, post-credits (§9.2) |
Three-point band (per tenant, Standard, incl. infra): Low ~$22 · Expected ~$28 · High ~$60. Design budgets to the High band, then replace assumptions with real run_id token logs from the pilot (Cost Guard ledger).
Optimize cost per accepted output, not $/1M tokens. When tuning model tiers, the objective is the lowest cost per QC-passed, human-accepted task — a cheaper model that fails QC triggers correction loops/retries/escalations that can cost more end-to-end. Start cheap and escalate on QC failure, demote on sustained success (model promotion/demotion); never demote safety/compliance-critical tasks on price (Special Ad Category, policy QC, health/education stay pinned to the higher tier).
9. Commitment & credit savings (GCP)
Three distinct mechanisms — only two are worth taking at this scale. Verify all rates/eligibility before committing.
9.1 Optional — Google for Startups Cloud Program (credits)
Not assumed in run-cost tables below. Leadership decides whether to apply; confirm with VC/board before pursuing.
Because this module is part of a VC/equity-backed project, it may qualify for Google's startup credits — which would dwarf run cost if granted:
| Tier | Coverage | Notes |
|---|---|---|
| AI-first (Scale/AI tier) | Up to $350K over 2 yrs — Yr1: 100% of eligible usage up to $250K; Yr2: 20% up to $100K | Best fit (this is an AI-agent product) |
| Equity-backed (standard) | Up to $200K — Yr1 up to $100K; Yr2 20% up to $100K | Fallback if AI tier not granted |
| Pre-funding | Up to $2K / yr | Not relevant once VC-backed |
- Covers: Gemini models + GCP services — Vertex AI, Agent Engine, BigQuery, Cloud Run, Pub/Sub, etc. (i.e. essentially the entire model in §3/§5).
- Excludes: third-party models (Claude Sonnet/Opus escalation, Llama) — billed directly — and Marketplace. Escalation is ~2% of LLM spend, so coverage is ~98%+ of the bill.
- Effect: run cost is ~$4K/yr (pilot) → ~$17K/yr (growth) → ~$55K/yr (scale) — all far under the $250K Year-1 cap. So Year 1 GCP+Gemini cost is effectively ~$0; only the
2% third-party escalation (tens of $/mo) is billed. Year 2 covers 20%. - Caveat: credits are runway, not a permanent discount — they expire at end of term. Use them to absorb the build + early-scale window, and budget steady-state to the §9.2/§8 numbers for after credits lapse.
- Action: apply via the Google for Startups portal (no traditional grant application); confirm AI-tier eligibility with the account team while opening the Meta/DV360 relationships (gameplan PF track).
9.2 Committed Use Discounts (CUDs) — steady-state, after credits
Spend-based Compute Flexible CUD: commit a minimum hourly spend; overage bills on-demand.
| Eligible spend | 1-year | 3-year | Relevance here |
|---|---|---|---|
| Cloud Run — request-based services / functions | 17% | 17% | Connectors, BFFs, jobs on request billing |
| Cloud Run — instance-based services, jobs, worker pools | 28% | 46% | Best on the always-on warm baseline (§4) — run those instance-based |
| Compute Engine / GKE (N/C/E families) | 28% | 46% | n/a — serverless-first |
- Net effect: applies only to the Cloud Run slice (
$150–340/mo at scale), so absolute saving is **$40–90/mo at Scale** — real but small, since compute isn't the cost driver. - Low risk if you size the commitment to the always-on baseline only (not per-tenant burst).
9.3 Where commitment does NOT pay at this scale (avoid the trap)
| Mechanism | Headline discount | Why to skip (for now) |
|---|---|---|
| BigQuery Editions slot commitment | 20% (1yr) / ~37–40% (3yr) | Requires switching from on-demand to slot/capacity (≥50-slot min). Our query scans are tiny and largely inside the 1-TiB free tier — on-demand ($6.25/TiB) is cheaper. Committing here would raise cost. Revisit only if telemetry query volume grows large. |
| Vertex AI Provisioned Throughput (PT) | Reserved GSU capacity | Pays off for high, steady low-latency inference. Agent load here is spiky/low → reserved capacity sits idle; on-demand + caching + batch is cheaper. Revisit at high Scale with steady volume. |
| Enterprise Discount Program (EDP) | Negotiated custom | Absolute spend (~$4–5K/mo) is too small to negotiate a meaningful committed-spend contract. |
9.4 Stacked effect (illustrative)
| Portfolio | List price /mo | Year 1 (startup credits) | Steady-state (caching + Cloud Run CUD) |
|---|---|---|---|
| Pilot (5) | ~$240 | ~$0–20 (only 3rd-party escalation) | ~$210 |
| Growth (50) | ~$1,280 | ~$30–60 | ~$1,030 |
| Scale (200) | ~$4,480 | ~$60–120 | ~$3,540 |
Order of magnitude of the levers: startup credits ≫ context caching > batch ≈ Cloud Run CUD ≫ BigQuery/PT commitments (negative here). Keep the design Gemini-first (already the policy) to maximize credit coverage, since third-party models are the only uncovered slice.
10. Bottom line
- Per active client: ~$8–42/mo (profile-dependent, expected) all-in infra+LLM; ~$19–28 blended at scale.
- Portfolio: ~$240/mo (pilot) → ~$1.1–1.3K/mo (50) → ~$3.7–4.5K/mo (200), excluding optional relay.
- Fixed baseline ~$220/mo (trimmed warm services, §4.1) dominates until ~12–18 tenants.
- Build — assumed go-live DDL (core, solo sprint): ~13 weeks calendar; G+M+T ~W9. Full-module accounting: ~50 engineer-weeks core (~67–69 with DV360 + relay) — capacity planning only, not MVP calendar.
- The dominant cash item is media float (working capital), not infra — plan it separately.
- All platform ad APIs are free; cost risk is concentrated in LLM optimization frequency — directly governed by Cost Guard + model routing already in the design.
- GCP commitment/credits (§9): the Google for Startups Cloud Program (up to $350K / 2 yrs) effectively zeros the GCP+Gemini bill in Year 1 (only ~2% third-party escalation billed). Cloud Run CUDs add ~17–46% on the small compute slice afterward. Do not commit BigQuery slots or Vertex Provisioned Throughput at this scale — on-demand is cheaper.