Architecture · Draft

System Ops Dashboard

Created 11 Jun 2026·Updated 11 Jun 2026

Latest change: Publish Dossier site and full doc pack to GitHub

Draft document — deep-dive spec incomplete; content will be updated before and during build. Do not treat as signed-off implementation detail. Pack overview

Purpose

The System Ops Dashboard is the governed developer / system-user console for Kobi Digital Ads infrastructure and agentic runtime. It holds logs, system status, and statistics — material that must not clutter the operator-facing Human Touch Dashboard.

Access (required)

Control Requirement Notes
Identity-Aware Proxy (IAP) Required on all System Ops routes Google identity or workforce SSO; no anonymous access
VPN Recommended for production Private connectivity to GCP (Cloud VPN / BeyondCorp); defense in depth with IAP
Service account keys Forbidden for UI access Humans use IAP identity only
Audit All page views and exports logged actor_id, resource, timestamp → append-only audit
System user / engineerOptional VPNIAP requiredSystem Ops BFF CloudRunCloud LoggingBigQuery telemetryGCS run artifactsGCP health APIs

Not on this surface: client portal, plan approvals, or routine operator workflows — those stay on Human Touch with business RBAC only.

Design goals

  1. Well governed — least privilege; separate IAM from operator roles; export controls on PII-bearing artifacts.
  2. Complete observability — every deterministic telemetry stream (QC, Cost Guard, loops, events) queryable here.
  3. Actionable statistics — rollups drive alerts and auto-optimization; dashboard is the inspection layer.
  4. No approval authority — System Ops can diagnose and recommend; A1–A9 decisions remain on Human Touch (unless user holds both roles).

Dashboard views

1. System health

  • Cloud Run services: orchestrator, connectors, HITL BFF, Cost Guard, QC threshold job — up/down, error rate, p95 latency
  • Vertex Agent Engine session health, Model Garden quota headroom
  • Pub/Sub backlog depth, dead-letter counts
  • Secret Manager rotation due dates
  • Per-environment banner (dev / staging / prod)

2. Logs and traces

  • Cloud Logging — structured filters: run_id, tenant_id, agent, severity, trace_id
  • Agent traces — Vertex Agent Engine / OpenTelemetry spans per run_id
  • Event bus — recent agent.qc.*, agent.loop.exhausted, cost.guard.tripped, agent.qc.threshold.breached
  • Export — JSON/CSV with audit trail; no bulk PII export without Admin + legal flag

3. Cost Guard monitor

  • Live run_id ledger: estimated / actual / ratio (deterministic, not AI)
  • Tripped runs (A9) with per-step token breakdown (BigQuery join)
  • Tenant daily LLM spend vs optional cap
  • Pricing table version in use vs Vertex billing export reconcile

4. QC health and statistics

  • Fail-rate leaderboardmain_agent × qc_task_id, 24h / 7d
  • 80% floor status — grains below success floor (agent.qc.threshold.breached)
  • Model breakdown — fail rate by model_main + thinking_level_main
  • Failure code heatmap — by platform, vertical, tenant
  • Playbook regression — fail spikes vs playbook_versions bumps
  • Loop depth — % runs needing 1 vs 2 corrections before pass or A8
  • Drill-downrun_id → ordered QC steps → input_slice_uri, qc_feedback_uri in GCS
  • Optimization queue — engineering tickets from threshold breaches

See QC loop telemetry and QC success threshold alerts.

5. Agent run explorer

  • Search by run_id, task_id, tenant_id, time range
  • Step timeline: main → QC → correction loops → tool rounds
  • Token and cost per step (join Cost Guard ledger)
  • Link to Human Touch ticket if A8/A9 open

6. Statistics and rollups

  • BigQuery materialized views: agent_qc_results, agent_qc_loops, agent_qc_failures_rollup
  • Model promotion / demotion history
  • QC first-pass rate trends by task class
  • A8 / A9 rates correlated with QC patterns
  • Vertex LLM spend by tenant, model, task (invoice-grade allocation)

7. Playbook and routing registry

  • Versioned playbook.routing, playbook.qc, playbook.platform.*, playbook.vertical.*
  • Diff between versions; which tenants pinned to which version
  • Context cache hit rates per playbook revision

8. Infrastructure and alerts

  • GCP budget alerts, LLM budget alerts (billing export)
  • PagerDuty / Slack alert delivery status
  • Repeat-offender tenants (≥3 A8 in 24h on same task class)
  • Open agent.qc.threshold.breached incidents

Roles (RBAC — System Ops)

Role Permissions
system_viewer Read health, logs (redacted), statistics
system_developer Full log/trace drill-down, GCS artifact read, playbook read
sre system_developer + infra actions (restart, scale, pause tenant automation)
system_admin sre + IAM binding changes, export approvals, pricing table publish

IAM groups are disjoint from Human Touch Operator / Planner by default. Overlap granted explicitly for senior staff.

Data handling

Data Human Touch System Ops
Approval diff preview ✓ (read-only)
A8/A9 ticket summary
Full QC JSON / GCS artifacts Link only
Cloud Logging raw stream
BigQuery statistical rollups
Infrastructure metrics

PII: System Ops stores hashes and URIs in BigQuery; full structured inputs in GCS with bucket IAM limited to system roles. Operator ticket views show human summaries only.

Deployment

Component GCP
System Ops UI Cloud Run (or internal static + BFF)
BFF API Cloud Run — server-side IAP JWT validation
IAP HTTPS load balancer backend service
VPN Cloud VPN or BeyondCorp connector to VPC
Data BigQuery, Cloud Logging, GCS run artifacts

Human Touch BFF and System Ops BFF are separate services — different IAP audiences, different Cloud Run services, no shared admin routes.