# VAULTMESH-OBSERVABILITY-ENGINE.md **Civilization Ledger Telemetry Primitive** > *Every metric tells a story. Every trace has a receipt.* Observability is VaultMesh's nervous system — capturing metrics, logs, and traces across all nodes and services, with cryptographic attestation that the telemetry itself hasn't been tampered with. --- ## 1. Scroll Definition | Property | Value | | --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | **Scroll Name** | `Observability` | | **JSONL Path** | `receipts/observability/observability_events.jsonl` | | **Root File** | `ROOT.observability.txt` | | **Receipt Types** | `obs_metric_snapshot`, `obs_log_batch`, `obs_trace_complete`, `obs_alert_fired`, `obs_alert_resolved`, `obs_slo_report`, `obs_anomaly_detected` | --- ## 2. Core Concepts ### 2.1 Metrics **Metrics** are time-series numerical measurements from nodes and services. ```json { "metric_id": "metric:brick-01:cpu:2025-12-06T14:30:00Z", "node": "did:vm:node:brick-01", "timestamp": "2025-12-06T14:30:00Z", "metrics": { "cpu_percent": 23.5, "memory_percent": 67.2, "disk_percent": 45.8, "network_rx_bytes": 1234567890, "network_tx_bytes": 987654321, "open_file_descriptors": 342, "goroutines": 156 }, "labels": { "environment": "production", "region": "eu-west", "service": "guardian" }, "collection_method": "prometheus_scrape", "scrape_duration_ms": 45 } ``` **Metric categories**: - `system` — CPU, memory, disk, network - `application` — request rates, latencies, error rates - `business` — receipts/hour, anchors/day, oracle queries - `security` — auth attempts, failed logins, blocked IPs - `mesh` — route latencies, node health, capability usage ### 2.2 Logs **Logs** are structured event records from all system components. ```json { "log_id": "log:guardian:2025-12-06T14:30:15.123Z", "timestamp": "2025-12-06T14:30:15.123Z", "level": "info", "service": "guardian", "node": "did:vm:node:brick-01", "message": "Anchor cycle completed successfully", "attributes": { "cycle_id": "anchor-cycle-2025-12-06-001", "receipts_anchored": 47, "scrolls_included": ["treasury", "mesh", "identity"], "duration_ms": 1234, "backend": "bitcoin" }, "trace_id": "trace-abc123...", "span_id": "span-def456...", "caller": "guardian/anchor.go:234" } ``` **Log levels**: - `trace` — verbose debugging (not retained long-term) - `debug` — debugging information - `info` — normal operations - `warn` — unexpected but handled conditions - `error` — errors requiring attention - `fatal` — system failures ### 2.3 Traces **Traces** track request flows across distributed components. ```json { "trace_id": "trace-abc123...", "name": "treasury_settlement", "start_time": "2025-12-06T14:30:00.000Z", "end_time": "2025-12-06T14:30:02.345Z", "duration_ms": 2345, "status": "ok", "spans": [ { "span_id": "span-001", "parent_span_id": null, "name": "http_request", "service": "portal", "node": "did:vm:node:portal-01", "start_time": "2025-12-06T14:30:00.000Z", "duration_ms": 2340, "attributes": { "http.method": "POST", "http.url": "/treasury/settle", "http.status_code": 200 } }, { "span_id": "span-002", "parent_span_id": "span-001", "name": "validate_settlement", "service": "treasury-engine", "node": "did:vm:node:brick-01", "start_time": "2025-12-06T14:30:00.100Z", "duration_ms": 150, "attributes": { "settlement_id": "settle-2025-12-06-001", "accounts_involved": 3 } }, { "span_id": "span-003", "parent_span_id": "span-001", "name": "emit_receipt", "service": "ledger", "node": "did:vm:node:brick-01", "start_time": "2025-12-06T14:30:00.250Z", "duration_ms": 50, "attributes": { "receipt_type": "treasury_settlement", "scroll": "treasury" } }, { "span_id": "span-004", "parent_span_id": "span-001", "name": "anchor_request", "service": "guardian", "node": "did:vm:node:brick-01", "start_time": "2025-12-06T14:30:00.300Z", "duration_ms": 2000, "attributes": { "backend": "bitcoin", "txid": "btc:abc123..." } } ], "tags": ["treasury", "settlement", "anchor"] } ``` ### 2.4 Alerts **Alerts** are triggered conditions requiring attention. ```json { "alert_id": "alert-2025-12-06-001", "name": "HighCPUUsage", "severity": "warning", "status": "firing", "fired_at": "2025-12-06T14:35:00Z", "node": "did:vm:node:brick-02", "rule": { "expression": "cpu_percent > 80 for 5m", "threshold": 80, "duration": "5m" }, "current_value": 87.3, "labels": { "environment": "production", "region": "eu-west" }, "annotations": { "summary": "CPU usage above 80% for 5 minutes", "runbook": "https://docs.vaultmesh.io/runbooks/high-cpu" }, "notified": ["slack:ops-channel", "pagerduty:on-call"] } ``` ### 2.5 SLO Reports **SLO (Service Level Objective) Reports** track reliability targets. ```json { "slo_id": "slo:anchor-latency-p99", "name": "Anchor Latency P99", "description": "99th percentile anchor latency under 30 seconds", "target": 0.999, "window": "30d", "report_period": { "start": "2025-11-06T00:00:00Z", "end": "2025-12-06T00:00:00Z" }, "achieved": 0.9995, "status": "met", "error_budget": { "total_minutes": 43.2, "consumed_minutes": 21.6, "remaining_percent": 50.0 }, "breakdown": { "total_requests": 125000, "good_requests": 124937, "bad_requests": 63 }, "trend": "stable" } ``` --- ## 3. Mapping to Eternal Pattern ### 3.1 Experience Layer (L1) **CLI** (`vm-obs`): ```bash # Metrics vm-obs metrics query --node brick-01 --metric cpu_percent --last 1h vm-obs metrics list --node brick-01 vm-obs metrics export --from 2025-12-01 --to 2025-12-06 --format prometheus # Logs vm-obs logs query --service guardian --level error --last 24h vm-obs logs tail --node brick-01 --follow vm-obs logs search "anchor failed" --from 2025-12-01 # Traces vm-obs trace show trace-abc123 vm-obs trace search --service treasury --duration ">1s" --last 24h vm-obs trace analyze trace-abc123 --find-bottleneck # Alerts vm-obs alert list --status firing vm-obs alert show alert-2025-12-06-001 vm-obs alert ack alert-2025-12-06-001 --comment "investigating" vm-obs alert silence --node brick-02 --duration 1h --reason "maintenance" # SLOs vm-obs slo list vm-obs slo show slo:anchor-latency-p99 vm-obs slo report --period 30d --format markdown # Dashboards vm-obs dashboard list vm-obs dashboard show system-overview vm-obs dashboard export system-overview --format grafana ``` **MCP Tools**: - `obs_metrics_query` — query metrics for a node/service - `obs_logs_search` — search logs with filters - `obs_trace_get` — retrieve trace details - `obs_alert_status` — current alert status - `obs_slo_summary` — SLO compliance summary - `obs_health_check` — overall system health **Portal HTTP**: - `GET /obs/metrics` — query metrics - `GET /obs/logs` — search logs - `GET /obs/traces` — list traces - `GET /obs/traces/{trace_id}` — trace details - `GET /obs/alerts` — list alerts - `POST /obs/alerts/{id}/ack` — acknowledge alert - `POST /obs/alerts/silence` — create silence - `GET /obs/slos` — list SLOs - `GET /obs/slos/{id}/report` — SLO report - `GET /obs/health` — system health --- ### 3.2 Engine Layer (L2) #### Step 1 — Plan → Implicit (Continuous Collection) Unlike discrete operations, observability collection is continuous. However, certain operations have explicit contracts: **Alert Acknowledgment Contract**: ```json { "operation_id": "obs-op-2025-12-06-001", "operation_type": "alert_acknowledge", "alert_id": "alert-2025-12-06-001", "acknowledged_by": "did:vm:user:sovereign", "acknowledged_at": "2025-12-06T14:40:00Z", "comment": "Investigating high CPU on brick-02, likely due to anchor backlog", "escalation_suppressed": true, "follow_up_required": true, "follow_up_deadline": "2025-12-06T16:00:00Z" } ``` **SLO Definition Contract**: ```json { "operation_id": "obs-op-2025-12-06-002", "operation_type": "slo_create", "initiated_by": "did:vm:user:sovereign", "slo": { "id": "slo:oracle-availability", "name": "Oracle Availability", "description": "Oracle service uptime", "indicator": { "type": "availability", "good_query": "oracle_up == 1", "total_query": "count(oracle_requests)" }, "target": 0.999, "window": "30d" } } ``` #### Step 2 — Execute → Continuous Collection Metrics, logs, and traces are collected continuously via: - Prometheus scraping (metrics) - Fluent Bit/Vector (logs) - OpenTelemetry SDK (traces) State is maintained in time-series databases and search indices, not as discrete state files. #### Step 3 — Seal → Receipts **Metric Snapshot Receipt** (hourly): ```json { "type": "obs_metric_snapshot", "snapshot_id": "metrics-2025-12-06-14", "timestamp": "2025-12-06T14:00:00Z", "period": { "start": "2025-12-06T13:00:00Z", "end": "2025-12-06T14:00:00Z" }, "nodes_reporting": 5, "metrics_collected": 15000, "aggregates": { "avg_cpu_percent": 34.5, "max_cpu_percent": 87.3, "avg_memory_percent": 62.1, "total_receipts_emitted": 1247, "total_anchors_completed": 12 }, "storage_path": "telemetry/metrics/2025-12-06/hour-14.parquet", "content_hash": "blake3:aaa111...", "tags": ["observability", "metrics", "hourly"], "root_hash": "blake3:bbb222..." } ``` **Log Batch Receipt** (hourly): ```json { "type": "obs_log_batch", "batch_id": "logs-2025-12-06-14", "timestamp": "2025-12-06T14:00:00Z", "period": { "start": "2025-12-06T13:00:00Z", "end": "2025-12-06T14:00:00Z" }, "log_counts": { "trace": 0, "debug": 12456, "info": 45678, "warn": 234, "error": 12, "fatal": 0 }, "services_reporting": ["guardian", "treasury", "portal", "oracle", "mesh"], "storage_path": "telemetry/logs/2025-12-06/hour-14.jsonl.gz", "content_hash": "blake3:ccc333...", "tags": ["observability", "logs", "hourly"], "root_hash": "blake3:ddd444..." } ``` **Trace Complete Receipt** (for significant traces): ```json { "type": "obs_trace_complete", "trace_id": "trace-abc123...", "timestamp": "2025-12-06T14:30:02.345Z", "name": "treasury_settlement", "duration_ms": 2345, "status": "ok", "span_count": 4, "services_involved": ["portal", "treasury-engine", "ledger", "guardian"], "nodes_involved": ["portal-01", "brick-01"], "triggered_by": "did:vm:user:sovereign", "business_context": { "settlement_id": "settle-2025-12-06-001", "amount": "1000.00 USD" }, "tags": ["observability", "trace", "treasury", "settlement"], "root_hash": "blake3:eee555..." } ``` **Alert Fired Receipt**: ```json { "type": "obs_alert_fired", "alert_id": "alert-2025-12-06-001", "timestamp": "2025-12-06T14:35:00Z", "name": "HighCPUUsage", "severity": "warning", "node": "did:vm:node:brick-02", "rule_expression": "cpu_percent > 80 for 5m", "current_value": 87.3, "threshold": 80, "notifications_sent": ["slack:ops-channel", "pagerduty:on-call"], "tags": ["observability", "alert", "fired", "cpu"], "root_hash": "blake3:fff666..." } ``` **Alert Resolved Receipt**: ```json { "type": "obs_alert_resolved", "alert_id": "alert-2025-12-06-001", "timestamp": "2025-12-06T15:10:00Z", "name": "HighCPUUsage", "fired_at": "2025-12-06T14:35:00Z", "duration_minutes": 35, "resolved_by": "automatic", "resolution_value": 42.1, "acknowledged": true, "acknowledged_by": "did:vm:user:sovereign", "root_cause": "anchor backlog cleared", "tags": ["observability", "alert", "resolved"], "root_hash": "blake3:ggg777..." } ``` **SLO Report Receipt** (daily): ```json { "type": "obs_slo_report", "report_id": "slo-report-2025-12-06", "timestamp": "2025-12-06T00:00:00Z", "period": { "start": "2025-11-06T00:00:00Z", "end": "2025-12-06T00:00:00Z" }, "slos": [ { "slo_id": "slo:anchor-latency-p99", "target": 0.999, "achieved": 0.9995, "status": "met" }, { "slo_id": "slo:oracle-availability", "target": 0.999, "achieved": 0.9987, "status": "at_risk" } ], "overall_status": "healthy", "error_budget_status": "sufficient", "report_path": "reports/slo/2025-12-06.json", "tags": ["observability", "slo", "daily-report"], "root_hash": "blake3:hhh888..." } ``` **Anomaly Detection Receipt**: ```json { "type": "obs_anomaly_detected", "anomaly_id": "anomaly-2025-12-06-001", "timestamp": "2025-12-06T14:45:00Z", "detection_method": "statistical", "metric": "treasury.receipts_per_minute", "node": "did:vm:node:brick-01", "expected_range": {"min": 10, "max": 50}, "observed_value": 2, "deviation_sigma": 4.2, "confidence": 0.98, "possible_causes": [ "upstream service degradation", "network partition", "configuration change" ], "correlated_events": ["alert-2025-12-06-001"], "tags": ["observability", "anomaly", "treasury"], "root_hash": "blake3:iii999..." } ``` --- ### 3.3 Ledger Layer (L3) **Receipt Types**: | Type | When Emitted | | ---------------------- | ------------------------------------- | | `obs_metric_snapshot` | Hourly metric aggregation | | `obs_log_batch` | Hourly log batch sealed | | `obs_trace_complete` | Significant trace completed | | `obs_alert_fired` | Alert triggered | | `obs_alert_resolved` | Alert resolved | | `obs_slo_report` | Daily SLO report | | `obs_anomaly_detected` | Statistical anomaly detected | **Merkle Coverage**: - All receipts append to `receipts/observability/observability_events.jsonl` - `ROOT.observability.txt` updated after each append - Guardian anchors Observability root in anchor cycles --- ## 4. Query Interface `observability_query_events.py`: ```bash # Metric snapshots vm-obs query --type metric_snapshot --from 2025-12-01 # Log batches with errors vm-obs query --type log_batch --filter "log_counts.error > 0" # Traces over 5 seconds vm-obs query --type trace_complete --filter "duration_ms > 5000" # All alerts for a node vm-obs query --type alert_fired,alert_resolved --node brick-02 # SLO reports with missed targets vm-obs query --type slo_report --filter "overall_status != 'healthy'" # Anomalies in last 7 days vm-obs query --type anomaly_detected --last 7d # Export for analysis vm-obs query --from 2025-12-01 --format parquet > observability_dec.parquet ``` **Correlation Tool**: ```bash # Correlate events around a timestamp vm-obs correlate --timestamp "2025-12-06T14:35:00Z" --window 15m # Output: # Timeline around 2025-12-06T14:35:00Z (±15m): # # 14:20:00 [metric] brick-02 cpu_percent starts rising # 14:25:00 [log] guardian: "anchor queue depth increasing" # 14:30:00 [trace] trace-abc123 completed (2345ms, normal) # 14:32:00 [metric] brick-02 cpu_percent crosses 80% # 14:35:00 [alert] HighCPUUsage fired on brick-02 # 14:40:00 [log] guardian: "processing backlog" # 14:45:00 [anomaly] treasury.receipts_per_minute low # 14:50:00 [log] guardian: "backlog cleared" # 15:10:00 [alert] HighCPUUsage resolved on brick-02 ``` --- ## 5. Design Gate Checklist | Question | Observability Answer | | --------------------- | ------------------------------------------------------------------ | | Clear entrypoint? | ✅ CLI (`vm-obs`), MCP tools, Portal HTTP | | Contract produced? | ✅ Implicit (continuous) + explicit for alert acks, SLO definitions | | State object? | ✅ Time-series DBs, search indices (continuous state) | | Receipts emitted? | ✅ Seven receipt types covering all observability events | | Append-only JSONL? | ✅ `receipts/observability/observability_events.jsonl` | | Merkle root? | ✅ `ROOT.observability.txt` | | Guardian anchor path? | ✅ Observability root included in ProofChain | | Query tool? | ✅ `observability_query_events.py` + correlation tool | --- ## 6. Data Pipeline ### 6.1 Collection Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ BRICK Nodes │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ brick-01│ │ brick-02│ │ brick-03│ │portal-01│ │ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Collection Layer │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │ │ │ │ │Prometheus│ │Fluent Bit│ │OpenTelemetry Collector│ │ │ │ │ │ (metrics)│ │ (logs) │ │ (traces) │ │ │ │ │ └────┬─────┘ └────┬─────┘ └──────────┬───────────┘ │ │ │ └───────┼─────────────┼───────────────────┼──────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Storage Layer │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │ │ │ │ │VictoriaM │ │ Loki/ │ │ Tempo/Jaeger │ │ │ │ │ │(metrics) │ │ OpenSearch│ │ (traces) │ │ │ │ │ └────┬─────┘ └────┬─────┘ └──────────┬───────────┘ │ │ │ └───────┼─────────────┼───────────────────┼──────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Receipt Layer │ │ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ │ │ Observability Receipt Emitter │ │ │ │ │ │ (hourly snapshots, alerts, SLOs, anomalies) │ │ │ │ │ └──────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` ### 6.2 Retention Policies | Data Type | Hot Storage | Warm Storage | Cold/Archive | Receipt | | ------------------ | -------------- | -------------- | -------------- | ------- | | Metrics (raw) | 7 days | 30 days | 1 year | Hourly | | Metrics (1h agg) | 30 days | 1 year | 5 years | Hourly | | Logs (all) | 7 days | 30 days | 1 year | Hourly | | Logs (error+) | 30 days | 1 year | 5 years | Hourly | | Traces (sampled) | 7 days | 30 days | — | Per-trace | | Traces (errors) | 30 days | 1 year | 5 years | Per-trace | | Alerts | Indefinite | Indefinite | Indefinite | Per-event | | SLO Reports | Indefinite | Indefinite | Indefinite | Daily | ### 6.3 Sampling Strategy ```json { "sampling_rules": [ { "name": "always_sample_errors", "condition": "status == 'error' OR level >= 'error'", "rate": 1.0 }, { "name": "always_sample_slow", "condition": "duration_ms > 5000", "rate": 1.0 }, { "name": "always_sample_sensitive", "condition": "service IN ['treasury', 'identity', 'offsec']", "rate": 1.0 }, { "name": "default_traces", "condition": "true", "rate": 0.1 } ] } ``` --- ## 7. Alerting Framework ### 7.1 Alert Rules ```yaml groups: - name: vaultmesh-critical rules: - alert: NodeDown expr: up == 0 for: 2m labels: severity: critical annotations: summary: "Node {{ $labels.node }} is down" runbook: https://docs.vaultmesh.io/runbooks/node-down - alert: AnchorBacklogHigh expr: guardian_anchor_queue_depth > 100 for: 10m labels: severity: warning annotations: summary: "Anchor queue depth is {{ $value }}" - alert: SLOBudgetBurning expr: slo_error_budget_remaining_percent < 25 for: 5m labels: severity: warning annotations: summary: "SLO {{ $labels.slo }} error budget at {{ $value }}%" ``` ### 7.2 Notification Channels | Severity | Channels | Response Time | | ----------- | ------------------------------------- | ------------- | | `critical` | PagerDuty, SMS, Slack #critical | Immediate | | `high` | PagerDuty, Slack #alerts | 15 minutes | | `warning` | Slack #alerts, Email | 1 hour | | `info` | Slack #ops | Best effort | --- ## 8. Integration Points | System | Integration | | ---------------- | ------------------------------------------------------------------------ | | **Guardian** | Emits anchor metrics/traces; alerts on anchor failures | | **Treasury** | Transaction metrics; latency SLOs; receipt throughput | | **Identity** | Auth event logs; failed login alerts; session metrics | | **Mesh** | Node health metrics; route latency; topology change logs | | **OffSec** | Security event correlation; incident timeline enrichment | | **Oracle** | Query latency metrics; confidence score distributions | | **Automation** | Workflow execution traces; n8n performance metrics | --- ## 9. Future Extensions - **AI-powered anomaly detection**: ML models for predictive alerting - **Distributed tracing visualization**: Real-time trace graphs in Portal - **Log pattern mining**: Automatic extraction of error patterns - **Chaos engineering integration**: Correlate chaos experiments with observability - **Cost attribution**: Resource usage per scroll/service for Treasury billing - **Compliance dashboards**: Real-time compliance posture visualization