Initialize repository snapshot

2025-12-27 00:10:32 +00:00
commit 110d644e10
281 changed files with 40331 additions and 0 deletions
--- a/docs/VAULTMESH-OBSERVABILITY-ENGINE.md
+++ b/docs/VAULTMESH-OBSERVABILITY-ENGINE.md
@@ -0,0 +1,742 @@
+# VAULTMESH-OBSERVABILITY-ENGINE.md
+
+**Civilization Ledger Telemetry Primitive**
+
+> *Every metric tells a story. Every trace has a receipt.*
+
+Observability is VaultMesh's nervous system — capturing metrics, logs, and traces across all nodes and services, with cryptographic attestation that the telemetry itself hasn't been tampered with.
+
+---
+
+## 1. Scroll Definition
+
+| Property              | Value                                                                                                                                     |
+| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
+| **Scroll Name**       | `Observability`                                                                                                                           |
+| **JSONL Path**        | `receipts/observability/observability_events.jsonl`                                                                                       |
+| **Root File**         | `ROOT.observability.txt`                                                                                                                  |
+| **Receipt Types**     | `obs_metric_snapshot`, `obs_log_batch`, `obs_trace_complete`, `obs_alert_fired`, `obs_alert_resolved`, `obs_slo_report`, `obs_anomaly_detected` |
+
+---
+
+## 2. Core Concepts
+
+### 2.1 Metrics
+
+**Metrics** are time-series numerical measurements from nodes and services.
+
+```json
+{
+  "metric_id": "metric:brick-01:cpu:2025-12-06T14:30:00Z",
+  "node": "did:vm:node:brick-01",
+  "timestamp": "2025-12-06T14:30:00Z",
+  "metrics": {
+    "cpu_percent": 23.5,
+    "memory_percent": 67.2,
+    "disk_percent": 45.8,
+    "network_rx_bytes": 1234567890,
+    "network_tx_bytes": 987654321,
+    "open_file_descriptors": 342,
+    "goroutines": 156
+  },
+  "labels": {
+    "environment": "production",
+    "region": "eu-west",
+    "service": "guardian"
+  },
+  "collection_method": "prometheus_scrape",
+  "scrape_duration_ms": 45
+}
+```
+
+**Metric categories**:
+- `system` — CPU, memory, disk, network
+- `application` — request rates, latencies, error rates
+- `business` — receipts/hour, anchors/day, oracle queries
+- `security` — auth attempts, failed logins, blocked IPs
+- `mesh` — route latencies, node health, capability usage
+
+### 2.2 Logs
+
+**Logs** are structured event records from all system components.
+
+```json
+{
+  "log_id": "log:guardian:2025-12-06T14:30:15.123Z",
+  "timestamp": "2025-12-06T14:30:15.123Z",
+  "level": "info",
+  "service": "guardian",
+  "node": "did:vm:node:brick-01",
+  "message": "Anchor cycle completed successfully",
+  "attributes": {
+    "cycle_id": "anchor-cycle-2025-12-06-001",
+    "receipts_anchored": 47,
+    "scrolls_included": ["treasury", "mesh", "identity"],
+    "duration_ms": 1234,
+    "backend": "bitcoin"
+  },
+  "trace_id": "trace-abc123...",
+  "span_id": "span-def456...",
+  "caller": "guardian/anchor.go:234"
+}
+```
+
+**Log levels**:
+- `trace` — verbose debugging (not retained long-term)
+- `debug` — debugging information
+- `info` — normal operations
+- `warn` — unexpected but handled conditions
+- `error` — errors requiring attention
+- `fatal` — system failures
+
+### 2.3 Traces
+
+**Traces** track request flows across distributed components.
+
+```json
+{
+  "trace_id": "trace-abc123...",
+  "name": "treasury_settlement",
+  "start_time": "2025-12-06T14:30:00.000Z",
+  "end_time": "2025-12-06T14:30:02.345Z",
+  "duration_ms": 2345,
+  "status": "ok",
+  "spans": [
+    {
+      "span_id": "span-001",
+      "parent_span_id": null,
+      "name": "http_request",
+      "service": "portal",
+      "node": "did:vm:node:portal-01",
+      "start_time": "2025-12-06T14:30:00.000Z",
+      "duration_ms": 2340,
+      "attributes": {
+        "http.method": "POST",
+        "http.url": "/treasury/settle",
+        "http.status_code": 200
+      }
+    },
+    {
+      "span_id": "span-002",
+      "parent_span_id": "span-001",
+      "name": "validate_settlement",
+      "service": "treasury-engine",
+      "node": "did:vm:node:brick-01",
+      "start_time": "2025-12-06T14:30:00.100Z",
+      "duration_ms": 150,
+      "attributes": {
+        "settlement_id": "settle-2025-12-06-001",
+        "accounts_involved": 3
+      }
+    },
+    {
+      "span_id": "span-003",
+      "parent_span_id": "span-001",
+      "name": "emit_receipt",
+      "service": "ledger",
+      "node": "did:vm:node:brick-01",
+      "start_time": "2025-12-06T14:30:00.250Z",
+      "duration_ms": 50,
+      "attributes": {
+        "receipt_type": "treasury_settlement",
+        "scroll": "treasury"
+      }
+    },
+    {
+      "span_id": "span-004",
+      "parent_span_id": "span-001",
+      "name": "anchor_request",
+      "service": "guardian",
+      "node": "did:vm:node:brick-01",
+      "start_time": "2025-12-06T14:30:00.300Z",
+      "duration_ms": 2000,
+      "attributes": {
+        "backend": "bitcoin",
+        "txid": "btc:abc123..."
+      }
+    }
+  ],
+  "tags": ["treasury", "settlement", "anchor"]
+}
+```
+
+### 2.4 Alerts
+
+**Alerts** are triggered conditions requiring attention.
+
+```json
+{
+  "alert_id": "alert-2025-12-06-001",
+  "name": "HighCPUUsage",
+  "severity": "warning",
+  "status": "firing",
+  "fired_at": "2025-12-06T14:35:00Z",
+  "node": "did:vm:node:brick-02",
+  "rule": {
+    "expression": "cpu_percent > 80 for 5m",
+    "threshold": 80,
+    "duration": "5m"
+  },
+  "current_value": 87.3,
+  "labels": {
+    "environment": "production",
+    "region": "eu-west"
+  },
+  "annotations": {
+    "summary": "CPU usage above 80% for 5 minutes",
+    "runbook": "https://docs.vaultmesh.io/runbooks/high-cpu"
+  },
+  "notified": ["slack:ops-channel", "pagerduty:on-call"]
+}
+```
+
+### 2.5 SLO Reports
+
+**SLO (Service Level Objective) Reports** track reliability targets.
+
+```json
+{
+  "slo_id": "slo:anchor-latency-p99",
+  "name": "Anchor Latency P99",
+  "description": "99th percentile anchor latency under 30 seconds",
+  "target": 0.999,
+  "window": "30d",
+  "report_period": {
+    "start": "2025-11-06T00:00:00Z",
+    "end": "2025-12-06T00:00:00Z"
+  },
+  "achieved": 0.9995,
+  "status": "met",
+  "error_budget": {
+    "total_minutes": 43.2,
+    "consumed_minutes": 21.6,
+    "remaining_percent": 50.0
+  },
+  "breakdown": {
+    "total_requests": 125000,
+    "good_requests": 124937,
+    "bad_requests": 63
+  },
+  "trend": "stable"
+}
+```
+
+---
+
+## 3. Mapping to Eternal Pattern
+
+### 3.1 Experience Layer (L1)
+
+**CLI** (`vm-obs`):
+```bash
+# Metrics
+vm-obs metrics query --node brick-01 --metric cpu_percent --last 1h
+vm-obs metrics list --node brick-01
+vm-obs metrics export --from 2025-12-01 --to 2025-12-06 --format prometheus
+
+# Logs
+vm-obs logs query --service guardian --level error --last 24h
+vm-obs logs tail --node brick-01 --follow
+vm-obs logs search "anchor failed" --from 2025-12-01
+
+# Traces
+vm-obs trace show trace-abc123
+vm-obs trace search --service treasury --duration ">1s" --last 24h
+vm-obs trace analyze trace-abc123 --find-bottleneck
+
+# Alerts
+vm-obs alert list --status firing
+vm-obs alert show alert-2025-12-06-001
+vm-obs alert ack alert-2025-12-06-001 --comment "investigating"
+vm-obs alert silence --node brick-02 --duration 1h --reason "maintenance"
+
+# SLOs
+vm-obs slo list
+vm-obs slo show slo:anchor-latency-p99
+vm-obs slo report --period 30d --format markdown
+
+# Dashboards
+vm-obs dashboard list
+vm-obs dashboard show system-overview
+vm-obs dashboard export system-overview --format grafana
+```
+
+**MCP Tools**:
+- `obs_metrics_query` — query metrics for a node/service
+- `obs_logs_search` — search logs with filters
+- `obs_trace_get` — retrieve trace details
+- `obs_alert_status` — current alert status
+- `obs_slo_summary` — SLO compliance summary
+- `obs_health_check` — overall system health
+
+**Portal HTTP**:
+- `GET /obs/metrics` — query metrics
+- `GET /obs/logs` — search logs
+- `GET /obs/traces` — list traces
+- `GET /obs/traces/{trace_id}` — trace details
+- `GET /obs/alerts` — list alerts
+- `POST /obs/alerts/{id}/ack` — acknowledge alert
+- `POST /obs/alerts/silence` — create silence
+- `GET /obs/slos` — list SLOs
+- `GET /obs/slos/{id}/report` — SLO report
+- `GET /obs/health` — system health
+
+---
+
+### 3.2 Engine Layer (L2)
+
+#### Step 1 — Plan → Implicit (Continuous Collection)
+
+Unlike discrete operations, observability collection is continuous. However, certain operations have explicit contracts:
+
+**Alert Acknowledgment Contract**:
+```json
+{
+  "operation_id": "obs-op-2025-12-06-001",
+  "operation_type": "alert_acknowledge",
+  "alert_id": "alert-2025-12-06-001",
+  "acknowledged_by": "did:vm:user:sovereign",
+  "acknowledged_at": "2025-12-06T14:40:00Z",
+  "comment": "Investigating high CPU on brick-02, likely due to anchor backlog",
+  "escalation_suppressed": true,
+  "follow_up_required": true,
+  "follow_up_deadline": "2025-12-06T16:00:00Z"
+}
+```
+
+**SLO Definition Contract**:
+```json
+{
+  "operation_id": "obs-op-2025-12-06-002",
+  "operation_type": "slo_create",
+  "initiated_by": "did:vm:user:sovereign",
+  "slo": {
+    "id": "slo:oracle-availability",
+    "name": "Oracle Availability",
+    "description": "Oracle service uptime",
+    "indicator": {
+      "type": "availability",
+      "good_query": "oracle_up == 1",
+      "total_query": "count(oracle_requests)"
+    },
+    "target": 0.999,
+    "window": "30d"
+  }
+}
+```
+
+#### Step 2 — Execute → Continuous Collection
+
+Metrics, logs, and traces are collected continuously via:
+- Prometheus scraping (metrics)
+- Fluent Bit/Vector (logs)
+- OpenTelemetry SDK (traces)
+
+State is maintained in time-series databases and search indices, not as discrete state files.
+
+#### Step 3 — Seal → Receipts
+
+**Metric Snapshot Receipt** (hourly):
+```json
+{
+  "type": "obs_metric_snapshot",
+  "snapshot_id": "metrics-2025-12-06-14",
+  "timestamp": "2025-12-06T14:00:00Z",
+  "period": {
+    "start": "2025-12-06T13:00:00Z",
+    "end": "2025-12-06T14:00:00Z"
+  },
+  "nodes_reporting": 5,
+  "metrics_collected": 15000,
+  "aggregates": {
+    "avg_cpu_percent": 34.5,
+    "max_cpu_percent": 87.3,
+    "avg_memory_percent": 62.1,
+    "total_receipts_emitted": 1247,
+    "total_anchors_completed": 12
+  },
+  "storage_path": "telemetry/metrics/2025-12-06/hour-14.parquet",
+  "content_hash": "blake3:aaa111...",
+  "tags": ["observability", "metrics", "hourly"],
+  "root_hash": "blake3:bbb222..."
+}
+```
+
+**Log Batch Receipt** (hourly):
+```json
+{
+  "type": "obs_log_batch",
+  "batch_id": "logs-2025-12-06-14",
+  "timestamp": "2025-12-06T14:00:00Z",
+  "period": {
+    "start": "2025-12-06T13:00:00Z",
+    "end": "2025-12-06T14:00:00Z"
+  },
+  "log_counts": {
+    "trace": 0,
+    "debug": 12456,
+    "info": 45678,
+    "warn": 234,
+    "error": 12,
+    "fatal": 0
+  },
+  "services_reporting": ["guardian", "treasury", "portal", "oracle", "mesh"],
+  "storage_path": "telemetry/logs/2025-12-06/hour-14.jsonl.gz",
+  "content_hash": "blake3:ccc333...",
+  "tags": ["observability", "logs", "hourly"],
+  "root_hash": "blake3:ddd444..."
+}
+```
+
+**Trace Complete Receipt** (for significant traces):
+```json
+{
+  "type": "obs_trace_complete",
+  "trace_id": "trace-abc123...",
+  "timestamp": "2025-12-06T14:30:02.345Z",
+  "name": "treasury_settlement",
+  "duration_ms": 2345,
+  "status": "ok",
+  "span_count": 4,
+  "services_involved": ["portal", "treasury-engine", "ledger", "guardian"],
+  "nodes_involved": ["portal-01", "brick-01"],
+  "triggered_by": "did:vm:user:sovereign",
+  "business_context": {
+    "settlement_id": "settle-2025-12-06-001",
+    "amount": "1000.00 USD"
+  },
+  "tags": ["observability", "trace", "treasury", "settlement"],
+  "root_hash": "blake3:eee555..."
+}
+```
+
+**Alert Fired Receipt**:
+```json
+{
+  "type": "obs_alert_fired",
+  "alert_id": "alert-2025-12-06-001",
+  "timestamp": "2025-12-06T14:35:00Z",
+  "name": "HighCPUUsage",
+  "severity": "warning",
+  "node": "did:vm:node:brick-02",
+  "rule_expression": "cpu_percent > 80 for 5m",
+  "current_value": 87.3,
+  "threshold": 80,
+  "notifications_sent": ["slack:ops-channel", "pagerduty:on-call"],
+  "tags": ["observability", "alert", "fired", "cpu"],
+  "root_hash": "blake3:fff666..."
+}
+```
+
+**Alert Resolved Receipt**:
+```json
+{
+  "type": "obs_alert_resolved",
+  "alert_id": "alert-2025-12-06-001",
+  "timestamp": "2025-12-06T15:10:00Z",
+  "name": "HighCPUUsage",
+  "fired_at": "2025-12-06T14:35:00Z",
+  "duration_minutes": 35,
+  "resolved_by": "automatic",
+  "resolution_value": 42.1,
+  "acknowledged": true,
+  "acknowledged_by": "did:vm:user:sovereign",
+  "root_cause": "anchor backlog cleared",
+  "tags": ["observability", "alert", "resolved"],
+  "root_hash": "blake3:ggg777..."
+}
+```
+
+**SLO Report Receipt** (daily):
+```json
+{
+  "type": "obs_slo_report",
+  "report_id": "slo-report-2025-12-06",
+  "timestamp": "2025-12-06T00:00:00Z",
+  "period": {
+    "start": "2025-11-06T00:00:00Z",
+    "end": "2025-12-06T00:00:00Z"
+  },
+  "slos": [
+    {
+      "slo_id": "slo:anchor-latency-p99",
+      "target": 0.999,
+      "achieved": 0.9995,
+      "status": "met"
+    },
+    {
+      "slo_id": "slo:oracle-availability",
+      "target": 0.999,
+      "achieved": 0.9987,
+      "status": "at_risk"
+    }
+  ],
+  "overall_status": "healthy",
+  "error_budget_status": "sufficient",
+  "report_path": "reports/slo/2025-12-06.json",
+  "tags": ["observability", "slo", "daily-report"],
+  "root_hash": "blake3:hhh888..."
+}
+```
+
+**Anomaly Detection Receipt**:
+```json
+{
+  "type": "obs_anomaly_detected",
+  "anomaly_id": "anomaly-2025-12-06-001",
+  "timestamp": "2025-12-06T14:45:00Z",
+  "detection_method": "statistical",
+  "metric": "treasury.receipts_per_minute",
+  "node": "did:vm:node:brick-01",
+  "expected_range": {"min": 10, "max": 50},
+  "observed_value": 2,
+  "deviation_sigma": 4.2,
+  "confidence": 0.98,
+  "possible_causes": [
+    "upstream service degradation",
+    "network partition",
+    "configuration change"
+  ],
+  "correlated_events": ["alert-2025-12-06-001"],
+  "tags": ["observability", "anomaly", "treasury"],
+  "root_hash": "blake3:iii999..."
+}
+```
+
+---
+
+### 3.3 Ledger Layer (L3)
+
+**Receipt Types**:
+
+| Type                   | When Emitted                          |
+| ---------------------- | ------------------------------------- |
+| `obs_metric_snapshot`  | Hourly metric aggregation             |
+| `obs_log_batch`        | Hourly log batch sealed               |
+| `obs_trace_complete`   | Significant trace completed           |
+| `obs_alert_fired`      | Alert triggered                       |
+| `obs_alert_resolved`   | Alert resolved                        |
+| `obs_slo_report`       | Daily SLO report                      |
+| `obs_anomaly_detected` | Statistical anomaly detected          |
+
+**Merkle Coverage**:
+- All receipts append to `receipts/observability/observability_events.jsonl`
+- `ROOT.observability.txt` updated after each append
+- Guardian anchors Observability root in anchor cycles
+
+---
+
+## 4. Query Interface
+
+`observability_query_events.py`:
+
+```bash
+# Metric snapshots
+vm-obs query --type metric_snapshot --from 2025-12-01
+
+# Log batches with errors
+vm-obs query --type log_batch --filter "log_counts.error > 0"
+
+# Traces over 5 seconds
+vm-obs query --type trace_complete --filter "duration_ms > 5000"
+
+# All alerts for a node
+vm-obs query --type alert_fired,alert_resolved --node brick-02
+
+# SLO reports with missed targets
+vm-obs query --type slo_report --filter "overall_status != 'healthy'"
+
+# Anomalies in last 7 days
+vm-obs query --type anomaly_detected --last 7d
+
+# Export for analysis
+vm-obs query --from 2025-12-01 --format parquet > observability_dec.parquet
+```
+
+**Correlation Tool**:
+```bash
+# Correlate events around a timestamp
+vm-obs correlate --timestamp "2025-12-06T14:35:00Z" --window 15m
+
+# Output:
+# Timeline around 2025-12-06T14:35:00Z (±15m):
+#
+# 14:20:00 [metric] brick-02 cpu_percent starts rising
+# 14:25:00 [log] guardian: "anchor queue depth increasing"
+# 14:30:00 [trace] trace-abc123 completed (2345ms, normal)
+# 14:32:00 [metric] brick-02 cpu_percent crosses 80%
+# 14:35:00 [alert] HighCPUUsage fired on brick-02
+# 14:40:00 [log] guardian: "processing backlog"
+# 14:45:00 [anomaly] treasury.receipts_per_minute low
+# 14:50:00 [log] guardian: "backlog cleared"
+# 15:10:00 [alert] HighCPUUsage resolved on brick-02
+```
+
+---
+
+## 5. Design Gate Checklist
+
+| Question              | Observability Answer                                               |
+| --------------------- | ------------------------------------------------------------------ |
+| Clear entrypoint?     | ✅ CLI (`vm-obs`), MCP tools, Portal HTTP                          |
+| Contract produced?    | ✅ Implicit (continuous) + explicit for alert acks, SLO definitions |
+| State object?         | ✅ Time-series DBs, search indices (continuous state)              |
+| Receipts emitted?     | ✅ Seven receipt types covering all observability events           |
+| Append-only JSONL?    | ✅ `receipts/observability/observability_events.jsonl`             |
+| Merkle root?          | ✅ `ROOT.observability.txt`                                        |
+| Guardian anchor path? | ✅ Observability root included in ProofChain                       |
+| Query tool?           | ✅ `observability_query_events.py` + correlation tool              |
+
+---
+
+## 6. Data Pipeline
+
+### 6.1 Collection Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                         BRICK Nodes                              │
+│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐            │
+│  │ brick-01│  │ brick-02│  │ brick-03│  │portal-01│            │
+│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘            │
+│       │            │            │            │                   │
+│       ▼            ▼            ▼            ▼                   │
+│  ┌─────────────────────────────────────────────────────────┐    │
+│  │                    Collection Layer                      │    │
+│  │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐  │    │
+│  │  │Prometheus│  │Fluent Bit│  │OpenTelemetry Collector│  │    │
+│  │  │ (metrics)│  │  (logs)  │  │       (traces)        │  │    │
+│  │  └────┬─────┘  └────┬─────┘  └──────────┬───────────┘  │    │
+│  └───────┼─────────────┼───────────────────┼──────────────┘    │
+│          │             │                   │                     │
+│          ▼             ▼                   ▼                     │
+│  ┌─────────────────────────────────────────────────────────┐    │
+│  │                    Storage Layer                         │    │
+│  │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐  │    │
+│  │  │VictoriaM │  │ Loki/    │  │    Tempo/Jaeger      │  │    │
+│  │  │(metrics) │  │ OpenSearch│  │      (traces)        │  │    │
+│  │  └────┬─────┘  └────┬─────┘  └──────────┬───────────┘  │    │
+│  └───────┼─────────────┼───────────────────┼──────────────┘    │
+│          │             │                   │                     │
+│          ▼             ▼                   ▼                     │
+│  ┌─────────────────────────────────────────────────────────┐    │
+│  │                   Receipt Layer                          │    │
+│  │  ┌──────────────────────────────────────────────────┐  │    │
+│  │  │           Observability Receipt Emitter           │  │    │
+│  │  │   (hourly snapshots, alerts, SLOs, anomalies)     │  │    │
+│  │  └──────────────────────────────────────────────────┘  │    │
+│  └─────────────────────────────────────────────────────────┘    │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### 6.2 Retention Policies
+
+| Data Type          | Hot Storage    | Warm Storage   | Cold/Archive   | Receipt |
+| ------------------ | -------------- | -------------- | -------------- | ------- |
+| Metrics (raw)      | 7 days         | 30 days        | 1 year         | Hourly  |
+| Metrics (1h agg)   | 30 days        | 1 year         | 5 years        | Hourly  |
+| Logs (all)         | 7 days         | 30 days        | 1 year         | Hourly  |
+| Logs (error+)      | 30 days        | 1 year         | 5 years        | Hourly  |
+| Traces (sampled)   | 7 days         | 30 days        | —              | Per-trace |
+| Traces (errors)    | 30 days        | 1 year         | 5 years        | Per-trace |
+| Alerts             | Indefinite     | Indefinite     | Indefinite     | Per-event |
+| SLO Reports        | Indefinite     | Indefinite     | Indefinite     | Daily   |
+
+### 6.3 Sampling Strategy
+
+```json
+{
+  "sampling_rules": [
+    {
+      "name": "always_sample_errors",
+      "condition": "status == 'error' OR level >= 'error'",
+      "rate": 1.0
+    },
+    {
+      "name": "always_sample_slow",
+      "condition": "duration_ms > 5000",
+      "rate": 1.0
+    },
+    {
+      "name": "always_sample_sensitive",
+      "condition": "service IN ['treasury', 'identity', 'offsec']",
+      "rate": 1.0
+    },
+    {
+      "name": "default_traces",
+      "condition": "true",
+      "rate": 0.1
+    }
+  ]
+}
+```
+
+---
+
+## 7. Alerting Framework
+
+### 7.1 Alert Rules
+
+```yaml
+groups:
+  - name: vaultmesh-critical
+    rules:
+      - alert: NodeDown
+        expr: up == 0
+        for: 2m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Node {{ $labels.node }} is down"
+          runbook: https://docs.vaultmesh.io/runbooks/node-down
+
+      - alert: AnchorBacklogHigh
+        expr: guardian_anchor_queue_depth > 100
+        for: 10m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Anchor queue depth is {{ $value }}"
+
+      - alert: SLOBudgetBurning
+        expr: slo_error_budget_remaining_percent < 25
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "SLO {{ $labels.slo }} error budget at {{ $value }}%"
+```
+
+### 7.2 Notification Channels
+
+| Severity    | Channels                              | Response Time |
+| ----------- | ------------------------------------- | ------------- |
+| `critical`  | PagerDuty, SMS, Slack #critical       | Immediate     |
+| `high`      | PagerDuty, Slack #alerts              | 15 minutes    |
+| `warning`   | Slack #alerts, Email                  | 1 hour        |
+| `info`      | Slack #ops                            | Best effort   |
+
+---
+
+## 8. Integration Points
+
+| System           | Integration                                                              |
+| ---------------- | ------------------------------------------------------------------------ |
+| **Guardian**     | Emits anchor metrics/traces; alerts on anchor failures                   |
+| **Treasury**     | Transaction metrics; latency SLOs; receipt throughput                    |
+| **Identity**     | Auth event logs; failed login alerts; session metrics                    |
+| **Mesh**         | Node health metrics; route latency; topology change logs                 |
+| **OffSec**       | Security event correlation; incident timeline enrichment                 |
+| **Oracle**       | Query latency metrics; confidence score distributions                    |
+| **Automation**   | Workflow execution traces; n8n performance metrics                       |
+
+---
+
+## 9. Future Extensions
+
+- **AI-powered anomaly detection**: ML models for predictive alerting
+- **Distributed tracing visualization**: Real-time trace graphs in Portal
+- **Log pattern mining**: Automatic extraction of error patterns
+- **Chaos engineering integration**: Correlate chaos experiments with observability
+- **Cost attribution**: Resource usage per scroll/service for Treasury billing
+- **Compliance dashboards**: Real-time compliance posture visualization