Architecture · Draft
System Ops Dashboard
Purpose
The System Ops Dashboard is the governed developer / system-user console for Kobi Digital Ads infrastructure and agentic runtime. It holds logs, system status, and statistics — material that must not clutter the operator-facing Human Touch Dashboard.
Access (required)
| Control | Requirement | Notes |
|---|---|---|
| Identity-Aware Proxy (IAP) | Required on all System Ops routes | Google identity or workforce SSO; no anonymous access |
| VPN | Recommended for production | Private connectivity to GCP (Cloud VPN / BeyondCorp); defense in depth with IAP |
| Service account keys | Forbidden for UI access | Humans use IAP identity only |
| Audit | All page views and exports logged | actor_id, resource, timestamp → append-only audit |
Not on this surface: client portal, plan approvals, or routine operator workflows — those stay on Human Touch with business RBAC only.
Design goals
- Well governed — least privilege; separate IAM from operator roles; export controls on PII-bearing artifacts.
- Complete observability — every deterministic telemetry stream (QC, Cost Guard, loops, events) queryable here.
- Actionable statistics — rollups drive alerts and auto-optimization; dashboard is the inspection layer.
- No approval authority — System Ops can diagnose and recommend; A1–A9 decisions remain on Human Touch (unless user holds both roles).
Dashboard views
1. System health
- Cloud Run services: orchestrator, connectors, HITL BFF, Cost Guard, QC threshold job — up/down, error rate, p95 latency
- Vertex Agent Engine session health, Model Garden quota headroom
- Pub/Sub backlog depth, dead-letter counts
- Secret Manager rotation due dates
- Per-environment banner (dev / staging / prod)
2. Logs and traces
- Cloud Logging — structured filters:
run_id,tenant_id,agent,severity,trace_id - Agent traces — Vertex Agent Engine / OpenTelemetry spans per
run_id - Event bus — recent
agent.qc.*,agent.loop.exhausted,cost.guard.tripped,agent.qc.threshold.breached - Export — JSON/CSV with audit trail; no bulk PII export without Admin + legal flag
3. Cost Guard monitor
- Live
run_idledger: estimated / actual / ratio (deterministic, not AI) - Tripped runs (A9) with per-step token breakdown (BigQuery join)
- Tenant daily LLM spend vs optional cap
- Pricing table version in use vs Vertex billing export reconcile
4. QC health and statistics
- Fail-rate leaderboard —
main_agent×qc_task_id, 24h / 7d - 80% floor status — grains below success floor (
agent.qc.threshold.breached) - Model breakdown — fail rate by
model_main+thinking_level_main - Failure code heatmap — by platform, vertical, tenant
- Playbook regression — fail spikes vs
playbook_versionsbumps - Loop depth — % runs needing 1 vs 2 corrections before pass or A8
- Drill-down —
run_id→ ordered QC steps →input_slice_uri,qc_feedback_uriin GCS - Optimization queue — engineering tickets from threshold breaches
See QC loop telemetry and QC success threshold alerts.
5. Agent run explorer
- Search by
run_id,task_id,tenant_id, time range - Step timeline: main → QC → correction loops → tool rounds
- Token and cost per step (join Cost Guard ledger)
- Link to Human Touch ticket if A8/A9 open
6. Statistics and rollups
- BigQuery materialized views:
agent_qc_results,agent_qc_loops,agent_qc_failures_rollup - Model promotion / demotion history
- QC first-pass rate trends by task class
- A8 / A9 rates correlated with QC patterns
- Vertex LLM spend by tenant, model, task (invoice-grade allocation)
7. Playbook and routing registry
- Versioned
playbook.routing,playbook.qc,playbook.platform.*,playbook.vertical.* - Diff between versions; which tenants pinned to which version
- Context cache hit rates per playbook revision
8. Infrastructure and alerts
- GCP budget alerts, LLM budget alerts (billing export)
- PagerDuty / Slack alert delivery status
- Repeat-offender tenants (≥3 A8 in 24h on same task class)
- Open
agent.qc.threshold.breachedincidents
Roles (RBAC — System Ops)
| Role | Permissions |
|---|---|
| system_viewer | Read health, logs (redacted), statistics |
| system_developer | Full log/trace drill-down, GCS artifact read, playbook read |
| sre | system_developer + infra actions (restart, scale, pause tenant automation) |
| system_admin | sre + IAM binding changes, export approvals, pricing table publish |
IAM groups are disjoint from Human Touch Operator / Planner by default. Overlap granted explicitly for senior staff.
Data handling
| Data | Human Touch | System Ops |
|---|---|---|
| Approval diff preview | ✓ | ✓ (read-only) |
| A8/A9 ticket summary | ✓ | ✓ |
| Full QC JSON / GCS artifacts | Link only | ✓ |
| Cloud Logging raw stream | ✗ | ✓ |
| BigQuery statistical rollups | ✗ | ✓ |
| Infrastructure metrics | ✗ | ✓ |
PII: System Ops stores hashes and URIs in BigQuery; full structured inputs in GCS with bucket IAM limited to system roles. Operator ticket views show human summaries only.
Deployment
| Component | GCP |
|---|---|
| System Ops UI | Cloud Run (or internal static + BFF) |
| BFF API | Cloud Run — server-side IAP JWT validation |
| IAP | HTTPS load balancer backend service |
| VPN | Cloud VPN or BeyondCorp connector to VPC |
| Data | BigQuery, Cloud Logging, GCS run artifacts |
Human Touch BFF and System Ops BFF are separate services — different IAP audiences, different Cloud Run services, no shared admin routes.
Related documents
- Human control plane — operator approvals (no system statistics)
- Agentic orchestration — telemetry and alert sources
- GCP deployment topology — runtime placement
- 07-security-access-governance.md — credential and network policy