Files
vm-core/docs/VAULTMESH-OBSERVABILITY-ENGINE.md
2025-12-27 00:10:32 +00:00

25 KiB

VAULTMESH-OBSERVABILITY-ENGINE.md

Civilization Ledger Telemetry Primitive

Every metric tells a story. Every trace has a receipt.

Observability is VaultMesh's nervous system — capturing metrics, logs, and traces across all nodes and services, with cryptographic attestation that the telemetry itself hasn't been tampered with.


1. Scroll Definition

Property Value
Scroll Name Observability
JSONL Path receipts/observability/observability_events.jsonl
Root File ROOT.observability.txt
Receipt Types obs_metric_snapshot, obs_log_batch, obs_trace_complete, obs_alert_fired, obs_alert_resolved, obs_slo_report, obs_anomaly_detected

2. Core Concepts

2.1 Metrics

Metrics are time-series numerical measurements from nodes and services.

{
  "metric_id": "metric:brick-01:cpu:2025-12-06T14:30:00Z",
  "node": "did:vm:node:brick-01",
  "timestamp": "2025-12-06T14:30:00Z",
  "metrics": {
    "cpu_percent": 23.5,
    "memory_percent": 67.2,
    "disk_percent": 45.8,
    "network_rx_bytes": 1234567890,
    "network_tx_bytes": 987654321,
    "open_file_descriptors": 342,
    "goroutines": 156
  },
  "labels": {
    "environment": "production",
    "region": "eu-west",
    "service": "guardian"
  },
  "collection_method": "prometheus_scrape",
  "scrape_duration_ms": 45
}

Metric categories:

  • system — CPU, memory, disk, network
  • application — request rates, latencies, error rates
  • business — receipts/hour, anchors/day, oracle queries
  • security — auth attempts, failed logins, blocked IPs
  • mesh — route latencies, node health, capability usage

2.2 Logs

Logs are structured event records from all system components.

{
  "log_id": "log:guardian:2025-12-06T14:30:15.123Z",
  "timestamp": "2025-12-06T14:30:15.123Z",
  "level": "info",
  "service": "guardian",
  "node": "did:vm:node:brick-01",
  "message": "Anchor cycle completed successfully",
  "attributes": {
    "cycle_id": "anchor-cycle-2025-12-06-001",
    "receipts_anchored": 47,
    "scrolls_included": ["treasury", "mesh", "identity"],
    "duration_ms": 1234,
    "backend": "bitcoin"
  },
  "trace_id": "trace-abc123...",
  "span_id": "span-def456...",
  "caller": "guardian/anchor.go:234"
}

Log levels:

  • trace — verbose debugging (not retained long-term)
  • debug — debugging information
  • info — normal operations
  • warn — unexpected but handled conditions
  • error — errors requiring attention
  • fatal — system failures

2.3 Traces

Traces track request flows across distributed components.

{
  "trace_id": "trace-abc123...",
  "name": "treasury_settlement",
  "start_time": "2025-12-06T14:30:00.000Z",
  "end_time": "2025-12-06T14:30:02.345Z",
  "duration_ms": 2345,
  "status": "ok",
  "spans": [
    {
      "span_id": "span-001",
      "parent_span_id": null,
      "name": "http_request",
      "service": "portal",
      "node": "did:vm:node:portal-01",
      "start_time": "2025-12-06T14:30:00.000Z",
      "duration_ms": 2340,
      "attributes": {
        "http.method": "POST",
        "http.url": "/treasury/settle",
        "http.status_code": 200
      }
    },
    {
      "span_id": "span-002",
      "parent_span_id": "span-001",
      "name": "validate_settlement",
      "service": "treasury-engine",
      "node": "did:vm:node:brick-01",
      "start_time": "2025-12-06T14:30:00.100Z",
      "duration_ms": 150,
      "attributes": {
        "settlement_id": "settle-2025-12-06-001",
        "accounts_involved": 3
      }
    },
    {
      "span_id": "span-003",
      "parent_span_id": "span-001",
      "name": "emit_receipt",
      "service": "ledger",
      "node": "did:vm:node:brick-01",
      "start_time": "2025-12-06T14:30:00.250Z",
      "duration_ms": 50,
      "attributes": {
        "receipt_type": "treasury_settlement",
        "scroll": "treasury"
      }
    },
    {
      "span_id": "span-004",
      "parent_span_id": "span-001",
      "name": "anchor_request",
      "service": "guardian",
      "node": "did:vm:node:brick-01",
      "start_time": "2025-12-06T14:30:00.300Z",
      "duration_ms": 2000,
      "attributes": {
        "backend": "bitcoin",
        "txid": "btc:abc123..."
      }
    }
  ],
  "tags": ["treasury", "settlement", "anchor"]
}

2.4 Alerts

Alerts are triggered conditions requiring attention.

{
  "alert_id": "alert-2025-12-06-001",
  "name": "HighCPUUsage",
  "severity": "warning",
  "status": "firing",
  "fired_at": "2025-12-06T14:35:00Z",
  "node": "did:vm:node:brick-02",
  "rule": {
    "expression": "cpu_percent > 80 for 5m",
    "threshold": 80,
    "duration": "5m"
  },
  "current_value": 87.3,
  "labels": {
    "environment": "production",
    "region": "eu-west"
  },
  "annotations": {
    "summary": "CPU usage above 80% for 5 minutes",
    "runbook": "https://docs.vaultmesh.io/runbooks/high-cpu"
  },
  "notified": ["slack:ops-channel", "pagerduty:on-call"]
}

2.5 SLO Reports

SLO (Service Level Objective) Reports track reliability targets.

{
  "slo_id": "slo:anchor-latency-p99",
  "name": "Anchor Latency P99",
  "description": "99th percentile anchor latency under 30 seconds",
  "target": 0.999,
  "window": "30d",
  "report_period": {
    "start": "2025-11-06T00:00:00Z",
    "end": "2025-12-06T00:00:00Z"
  },
  "achieved": 0.9995,
  "status": "met",
  "error_budget": {
    "total_minutes": 43.2,
    "consumed_minutes": 21.6,
    "remaining_percent": 50.0
  },
  "breakdown": {
    "total_requests": 125000,
    "good_requests": 124937,
    "bad_requests": 63
  },
  "trend": "stable"
}

3. Mapping to Eternal Pattern

3.1 Experience Layer (L1)

CLI (vm-obs):

# Metrics
vm-obs metrics query --node brick-01 --metric cpu_percent --last 1h
vm-obs metrics list --node brick-01
vm-obs metrics export --from 2025-12-01 --to 2025-12-06 --format prometheus

# Logs
vm-obs logs query --service guardian --level error --last 24h
vm-obs logs tail --node brick-01 --follow
vm-obs logs search "anchor failed" --from 2025-12-01

# Traces
vm-obs trace show trace-abc123
vm-obs trace search --service treasury --duration ">1s" --last 24h
vm-obs trace analyze trace-abc123 --find-bottleneck

# Alerts
vm-obs alert list --status firing
vm-obs alert show alert-2025-12-06-001
vm-obs alert ack alert-2025-12-06-001 --comment "investigating"
vm-obs alert silence --node brick-02 --duration 1h --reason "maintenance"

# SLOs
vm-obs slo list
vm-obs slo show slo:anchor-latency-p99
vm-obs slo report --period 30d --format markdown

# Dashboards
vm-obs dashboard list
vm-obs dashboard show system-overview
vm-obs dashboard export system-overview --format grafana

MCP Tools:

  • obs_metrics_query — query metrics for a node/service
  • obs_logs_search — search logs with filters
  • obs_trace_get — retrieve trace details
  • obs_alert_status — current alert status
  • obs_slo_summary — SLO compliance summary
  • obs_health_check — overall system health

Portal HTTP:

  • GET /obs/metrics — query metrics
  • GET /obs/logs — search logs
  • GET /obs/traces — list traces
  • GET /obs/traces/{trace_id} — trace details
  • GET /obs/alerts — list alerts
  • POST /obs/alerts/{id}/ack — acknowledge alert
  • POST /obs/alerts/silence — create silence
  • GET /obs/slos — list SLOs
  • GET /obs/slos/{id}/report — SLO report
  • GET /obs/health — system health

3.2 Engine Layer (L2)

Step 1 — Plan → Implicit (Continuous Collection)

Unlike discrete operations, observability collection is continuous. However, certain operations have explicit contracts:

Alert Acknowledgment Contract:

{
  "operation_id": "obs-op-2025-12-06-001",
  "operation_type": "alert_acknowledge",
  "alert_id": "alert-2025-12-06-001",
  "acknowledged_by": "did:vm:user:sovereign",
  "acknowledged_at": "2025-12-06T14:40:00Z",
  "comment": "Investigating high CPU on brick-02, likely due to anchor backlog",
  "escalation_suppressed": true,
  "follow_up_required": true,
  "follow_up_deadline": "2025-12-06T16:00:00Z"
}

SLO Definition Contract:

{
  "operation_id": "obs-op-2025-12-06-002",
  "operation_type": "slo_create",
  "initiated_by": "did:vm:user:sovereign",
  "slo": {
    "id": "slo:oracle-availability",
    "name": "Oracle Availability",
    "description": "Oracle service uptime",
    "indicator": {
      "type": "availability",
      "good_query": "oracle_up == 1",
      "total_query": "count(oracle_requests)"
    },
    "target": 0.999,
    "window": "30d"
  }
}

Step 2 — Execute → Continuous Collection

Metrics, logs, and traces are collected continuously via:

  • Prometheus scraping (metrics)
  • Fluent Bit/Vector (logs)
  • OpenTelemetry SDK (traces)

State is maintained in time-series databases and search indices, not as discrete state files.

Step 3 — Seal → Receipts

Metric Snapshot Receipt (hourly):

{
  "type": "obs_metric_snapshot",
  "snapshot_id": "metrics-2025-12-06-14",
  "timestamp": "2025-12-06T14:00:00Z",
  "period": {
    "start": "2025-12-06T13:00:00Z",
    "end": "2025-12-06T14:00:00Z"
  },
  "nodes_reporting": 5,
  "metrics_collected": 15000,
  "aggregates": {
    "avg_cpu_percent": 34.5,
    "max_cpu_percent": 87.3,
    "avg_memory_percent": 62.1,
    "total_receipts_emitted": 1247,
    "total_anchors_completed": 12
  },
  "storage_path": "telemetry/metrics/2025-12-06/hour-14.parquet",
  "content_hash": "blake3:aaa111...",
  "tags": ["observability", "metrics", "hourly"],
  "root_hash": "blake3:bbb222..."
}

Log Batch Receipt (hourly):

{
  "type": "obs_log_batch",
  "batch_id": "logs-2025-12-06-14",
  "timestamp": "2025-12-06T14:00:00Z",
  "period": {
    "start": "2025-12-06T13:00:00Z",
    "end": "2025-12-06T14:00:00Z"
  },
  "log_counts": {
    "trace": 0,
    "debug": 12456,
    "info": 45678,
    "warn": 234,
    "error": 12,
    "fatal": 0
  },
  "services_reporting": ["guardian", "treasury", "portal", "oracle", "mesh"],
  "storage_path": "telemetry/logs/2025-12-06/hour-14.jsonl.gz",
  "content_hash": "blake3:ccc333...",
  "tags": ["observability", "logs", "hourly"],
  "root_hash": "blake3:ddd444..."
}

Trace Complete Receipt (for significant traces):

{
  "type": "obs_trace_complete",
  "trace_id": "trace-abc123...",
  "timestamp": "2025-12-06T14:30:02.345Z",
  "name": "treasury_settlement",
  "duration_ms": 2345,
  "status": "ok",
  "span_count": 4,
  "services_involved": ["portal", "treasury-engine", "ledger", "guardian"],
  "nodes_involved": ["portal-01", "brick-01"],
  "triggered_by": "did:vm:user:sovereign",
  "business_context": {
    "settlement_id": "settle-2025-12-06-001",
    "amount": "1000.00 USD"
  },
  "tags": ["observability", "trace", "treasury", "settlement"],
  "root_hash": "blake3:eee555..."
}

Alert Fired Receipt:

{
  "type": "obs_alert_fired",
  "alert_id": "alert-2025-12-06-001",
  "timestamp": "2025-12-06T14:35:00Z",
  "name": "HighCPUUsage",
  "severity": "warning",
  "node": "did:vm:node:brick-02",
  "rule_expression": "cpu_percent > 80 for 5m",
  "current_value": 87.3,
  "threshold": 80,
  "notifications_sent": ["slack:ops-channel", "pagerduty:on-call"],
  "tags": ["observability", "alert", "fired", "cpu"],
  "root_hash": "blake3:fff666..."
}

Alert Resolved Receipt:

{
  "type": "obs_alert_resolved",
  "alert_id": "alert-2025-12-06-001",
  "timestamp": "2025-12-06T15:10:00Z",
  "name": "HighCPUUsage",
  "fired_at": "2025-12-06T14:35:00Z",
  "duration_minutes": 35,
  "resolved_by": "automatic",
  "resolution_value": 42.1,
  "acknowledged": true,
  "acknowledged_by": "did:vm:user:sovereign",
  "root_cause": "anchor backlog cleared",
  "tags": ["observability", "alert", "resolved"],
  "root_hash": "blake3:ggg777..."
}

SLO Report Receipt (daily):

{
  "type": "obs_slo_report",
  "report_id": "slo-report-2025-12-06",
  "timestamp": "2025-12-06T00:00:00Z",
  "period": {
    "start": "2025-11-06T00:00:00Z",
    "end": "2025-12-06T00:00:00Z"
  },
  "slos": [
    {
      "slo_id": "slo:anchor-latency-p99",
      "target": 0.999,
      "achieved": 0.9995,
      "status": "met"
    },
    {
      "slo_id": "slo:oracle-availability",
      "target": 0.999,
      "achieved": 0.9987,
      "status": "at_risk"
    }
  ],
  "overall_status": "healthy",
  "error_budget_status": "sufficient",
  "report_path": "reports/slo/2025-12-06.json",
  "tags": ["observability", "slo", "daily-report"],
  "root_hash": "blake3:hhh888..."
}

Anomaly Detection Receipt:

{
  "type": "obs_anomaly_detected",
  "anomaly_id": "anomaly-2025-12-06-001",
  "timestamp": "2025-12-06T14:45:00Z",
  "detection_method": "statistical",
  "metric": "treasury.receipts_per_minute",
  "node": "did:vm:node:brick-01",
  "expected_range": {"min": 10, "max": 50},
  "observed_value": 2,
  "deviation_sigma": 4.2,
  "confidence": 0.98,
  "possible_causes": [
    "upstream service degradation",
    "network partition",
    "configuration change"
  ],
  "correlated_events": ["alert-2025-12-06-001"],
  "tags": ["observability", "anomaly", "treasury"],
  "root_hash": "blake3:iii999..."
}

3.3 Ledger Layer (L3)

Receipt Types:

Type When Emitted
obs_metric_snapshot Hourly metric aggregation
obs_log_batch Hourly log batch sealed
obs_trace_complete Significant trace completed
obs_alert_fired Alert triggered
obs_alert_resolved Alert resolved
obs_slo_report Daily SLO report
obs_anomaly_detected Statistical anomaly detected

Merkle Coverage:

  • All receipts append to receipts/observability/observability_events.jsonl
  • ROOT.observability.txt updated after each append
  • Guardian anchors Observability root in anchor cycles

4. Query Interface

observability_query_events.py:

# Metric snapshots
vm-obs query --type metric_snapshot --from 2025-12-01

# Log batches with errors
vm-obs query --type log_batch --filter "log_counts.error > 0"

# Traces over 5 seconds
vm-obs query --type trace_complete --filter "duration_ms > 5000"

# All alerts for a node
vm-obs query --type alert_fired,alert_resolved --node brick-02

# SLO reports with missed targets
vm-obs query --type slo_report --filter "overall_status != 'healthy'"

# Anomalies in last 7 days
vm-obs query --type anomaly_detected --last 7d

# Export for analysis
vm-obs query --from 2025-12-01 --format parquet > observability_dec.parquet

Correlation Tool:

# Correlate events around a timestamp
vm-obs correlate --timestamp "2025-12-06T14:35:00Z" --window 15m

# Output:
# Timeline around 2025-12-06T14:35:00Z (±15m):
#
# 14:20:00 [metric] brick-02 cpu_percent starts rising
# 14:25:00 [log] guardian: "anchor queue depth increasing"
# 14:30:00 [trace] trace-abc123 completed (2345ms, normal)
# 14:32:00 [metric] brick-02 cpu_percent crosses 80%
# 14:35:00 [alert] HighCPUUsage fired on brick-02
# 14:40:00 [log] guardian: "processing backlog"
# 14:45:00 [anomaly] treasury.receipts_per_minute low
# 14:50:00 [log] guardian: "backlog cleared"
# 15:10:00 [alert] HighCPUUsage resolved on brick-02

5. Design Gate Checklist

Question Observability Answer
Clear entrypoint? CLI (vm-obs), MCP tools, Portal HTTP
Contract produced? Implicit (continuous) + explicit for alert acks, SLO definitions
State object? Time-series DBs, search indices (continuous state)
Receipts emitted? Seven receipt types covering all observability events
Append-only JSONL? receipts/observability/observability_events.jsonl
Merkle root? ROOT.observability.txt
Guardian anchor path? Observability root included in ProofChain
Query tool? observability_query_events.py + correlation tool

6. Data Pipeline

6.1 Collection Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         BRICK Nodes                              │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐            │
│  │ brick-01│  │ brick-02│  │ brick-03│  │portal-01│            │
│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘            │
│       │            │            │            │                   │
│       ▼            ▼            ▼            ▼                   │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Collection Layer                      │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐  │    │
│  │  │Prometheus│  │Fluent Bit│  │OpenTelemetry Collector│  │    │
│  │  │ (metrics)│  │  (logs)  │  │       (traces)        │  │    │
│  │  └────┬─────┘  └────┬─────┘  └──────────┬───────────┘  │    │
│  └───────┼─────────────┼───────────────────┼──────────────┘    │
│          │             │                   │                     │
│          ▼             ▼                   ▼                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Storage Layer                         │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐  │    │
│  │  │VictoriaM │  │ Loki/    │  │    Tempo/Jaeger      │  │    │
│  │  │(metrics) │  │ OpenSearch│  │      (traces)        │  │    │
│  │  └────┬─────┘  └────┬─────┘  └──────────┬───────────┘  │    │
│  └───────┼─────────────┼───────────────────┼──────────────┘    │
│          │             │                   │                     │
│          ▼             ▼                   ▼                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   Receipt Layer                          │    │
│  │  ┌──────────────────────────────────────────────────┐  │    │
│  │  │           Observability Receipt Emitter           │  │    │
│  │  │   (hourly snapshots, alerts, SLOs, anomalies)     │  │    │
│  │  └──────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

6.2 Retention Policies

Data Type Hot Storage Warm Storage Cold/Archive Receipt
Metrics (raw) 7 days 30 days 1 year Hourly
Metrics (1h agg) 30 days 1 year 5 years Hourly
Logs (all) 7 days 30 days 1 year Hourly
Logs (error+) 30 days 1 year 5 years Hourly
Traces (sampled) 7 days 30 days Per-trace
Traces (errors) 30 days 1 year 5 years Per-trace
Alerts Indefinite Indefinite Indefinite Per-event
SLO Reports Indefinite Indefinite Indefinite Daily

6.3 Sampling Strategy

{
  "sampling_rules": [
    {
      "name": "always_sample_errors",
      "condition": "status == 'error' OR level >= 'error'",
      "rate": 1.0
    },
    {
      "name": "always_sample_slow",
      "condition": "duration_ms > 5000",
      "rate": 1.0
    },
    {
      "name": "always_sample_sensitive",
      "condition": "service IN ['treasury', 'identity', 'offsec']",
      "rate": 1.0
    },
    {
      "name": "default_traces",
      "condition": "true",
      "rate": 0.1
    }
  ]
}

7. Alerting Framework

7.1 Alert Rules

groups:
  - name: vaultmesh-critical
    rules:
      - alert: NodeDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} is down"
          runbook: https://docs.vaultmesh.io/runbooks/node-down

      - alert: AnchorBacklogHigh
        expr: guardian_anchor_queue_depth > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Anchor queue depth is {{ $value }}"

      - alert: SLOBudgetBurning
        expr: slo_error_budget_remaining_percent < 25
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SLO {{ $labels.slo }} error budget at {{ $value }}%"

7.2 Notification Channels

Severity Channels Response Time
critical PagerDuty, SMS, Slack #critical Immediate
high PagerDuty, Slack #alerts 15 minutes
warning Slack #alerts, Email 1 hour
info Slack #ops Best effort

8. Integration Points

System Integration
Guardian Emits anchor metrics/traces; alerts on anchor failures
Treasury Transaction metrics; latency SLOs; receipt throughput
Identity Auth event logs; failed login alerts; session metrics
Mesh Node health metrics; route latency; topology change logs
OffSec Security event correlation; incident timeline enrichment
Oracle Query latency metrics; confidence score distributions
Automation Workflow execution traces; n8n performance metrics

9. Future Extensions

  • AI-powered anomaly detection: ML models for predictive alerting
  • Distributed tracing visualization: Real-time trace graphs in Portal
  • Log pattern mining: Automatic extraction of error patterns
  • Chaos engineering integration: Correlate chaos experiments with observability
  • Cost attribution: Resource usage per scroll/service for Treasury billing
  • Compliance dashboards: Real-time compliance posture visualization