System Ops Dashboard

Purpose

The System Ops Dashboard is the governed developer / system-user console for Kobi Digital Ads infrastructure and agentic runtime. It holds logs, system status, and statistics — material that must not clutter the operator-facing Human Touch Dashboard.

Access (required)

Control	Requirement	Notes
Identity-Aware Proxy (IAP)	Required on all System Ops routes	Google identity or workforce SSO; no anonymous access
VPN	Recommended for production	Private connectivity to GCP (Cloud VPN / BeyondCorp); defense in depth with IAP
Service account keys	Forbidden for UI access	Humans use IAP identity only
Audit	All page views and exports logged	`actor_id`, resource, timestamp → append-only audit

Not on this surface: client portal, plan approvals, or routine operator workflows — those stay on Human Touch with business RBAC only.

Design goals

Well governed — least privilege; separate IAM from operator roles; export controls on PII-bearing artifacts.
Complete observability — every deterministic telemetry stream (QC, Cost Guard, loops, events) queryable here.
Actionable statistics — rollups drive alerts and auto-optimization; dashboard is the inspection layer.
No approval authority — System Ops can diagnose and recommend; A1–A9 decisions remain on Human Touch (unless user holds both roles).

Dashboard views

1. System health

Cloud Run services: orchestrator, connectors, HITL BFF, Cost Guard, QC threshold job — up/down, error rate, p95 latency
Vertex Agent Engine session health, Model Garden quota headroom
Pub/Sub backlog depth, dead-letter counts
Secret Manager rotation due dates
Per-environment banner (dev / staging / prod)

2. Logs and traces

Cloud Logging — structured filters: run_id, tenant_id, agent, severity, trace_id
Agent traces — Vertex Agent Engine / OpenTelemetry spans per run_id
Event bus — recent agent.qc.*, agent.loop.exhausted, cost.guard.tripped, agent.qc.threshold.breached
Export — JSON/CSV with audit trail; no bulk PII export without Admin + legal flag

3. Cost Guard monitor

Live run_id ledger: estimated / actual / ratio (deterministic, not AI)
Tripped runs (A9) with per-step token breakdown (BigQuery join)
Tenant daily LLM spend vs optional cap
Pricing table version in use vs Vertex billing export reconcile

4. QC health and statistics

Fail-rate leaderboard — main_agent × qc_task_id, 24h / 7d
80% floor status — grains below success floor (agent.qc.threshold.breached)
Model breakdown — fail rate by model_main + thinking_level_main
Failure code heatmap — by platform, vertical, tenant
Playbook regression — fail spikes vs playbook_versions bumps
Loop depth — % runs needing 1 vs 2 corrections before pass or A8
Drill-down — run_id → ordered QC steps → input_slice_uri, qc_feedback_uri in GCS
Optimization queue — engineering tickets from threshold breaches

See QC loop telemetry and QC success threshold alerts.

5. Agent run explorer

Search by run_id, task_id, tenant_id, time range
Step timeline: main → QC → correction loops → tool rounds
Token and cost per step (join Cost Guard ledger)
Link to Human Touch ticket if A8/A9 open

6. Statistics and rollups

BigQuery materialized views: agent_qc_results, agent_qc_loops, agent_qc_failures_rollup
Model promotion / demotion history
QC first-pass rate trends by task class
A8 / A9 rates correlated with QC patterns
Vertex LLM spend by tenant, model, task (invoice-grade allocation)

7. Playbook and routing registry

Versioned playbook.routing, playbook.qc, playbook.platform.*, playbook.vertical.*
Diff between versions; which tenants pinned to which version
Context cache hit rates per playbook revision

8. Infrastructure and alerts

GCP budget alerts, LLM budget alerts (billing export)
PagerDuty / Slack alert delivery status
Repeat-offender tenants (≥3 A8 in 24h on same task class)
Open agent.qc.threshold.breached incidents

Roles (RBAC — System Ops)

Role	Permissions
system_viewer	Read health, logs (redacted), statistics
system_developer	Full log/trace drill-down, GCS artifact read, playbook read
sre	system_developer + infra actions (restart, scale, pause tenant automation)
system_admin	sre + IAM binding changes, export approvals, pricing table publish

IAM groups are disjoint from Human Touch Operator / Planner by default. Overlap granted explicitly for senior staff.

Data handling

Data	Human Touch	System Ops
Approval diff preview	✓	✓ (read-only)
A8/A9 ticket summary	✓	✓
Full QC JSON / GCS artifacts	Link only	✓
Cloud Logging raw stream	✗	✓
BigQuery statistical rollups	✗	✓
Infrastructure metrics	✗	✓

PII: System Ops stores hashes and URIs in BigQuery; full structured inputs in GCS with bucket IAM limited to system roles. Operator ticket views show human summaries only.

Deployment

Component	GCP
System Ops UI	Cloud Run (or internal static + BFF)
BFF API	Cloud Run — server-side IAP JWT validation
IAP	HTTPS load balancer backend service
VPN	Cloud VPN or BeyondCorp connector to VPC
Data	BigQuery, Cloud Logging, GCS run artifacts

Human Touch BFF and System Ops BFF are separate services — different IAP audiences, different Cloud Run services, no shared admin routes.

Human control plane — operator approvals (no system statistics)
Agentic orchestration — telemetry and alert sources
GCP deployment topology — runtime placement
07-security-access-governance.md — credential and network policy