vaultsovereign/vm-core

Fork 0

Files

Vault Sovereign 110d644e10 Initialize repository snapshot

2025-12-27 00:10:32 +00:00

25 KiB

Raw Blame History

VAULTMESH-OBSERVABILITY-ENGINE.md

Civilization Ledger Telemetry Primitive

Every metric tells a story. Every trace has a receipt.

Observability is VaultMesh's nervous system — capturing metrics, logs, and traces across all nodes and services, with cryptographic attestation that the telemetry itself hasn't been tampered with.

1. Scroll Definition

Property	Value
Scroll Name	`Observability`
JSONL Path	`receipts/observability/observability_events.jsonl`
Root File	`ROOT.observability.txt`
Receipt Types	`obs_metric_snapshot`, `obs_log_batch`, `obs_trace_complete`, `obs_alert_fired`, `obs_alert_resolved`, `obs_slo_report`, `obs_anomaly_detected`

2. Core Concepts

2.1 Metrics

Metrics are time-series numerical measurements from nodes and services.

{
  "metric_id": "metric:brick-01:cpu:2025-12-06T14:30:00Z",
  "node": "did:vm:node:brick-01",
  "timestamp": "2025-12-06T14:30:00Z",
  "metrics": {
    "cpu_percent": 23.5,
    "memory_percent": 67.2,
    "disk_percent": 45.8,
    "network_rx_bytes": 1234567890,
    "network_tx_bytes": 987654321,
    "open_file_descriptors": 342,
    "goroutines": 156
  },
  "labels": {
    "environment": "production",
    "region": "eu-west",
    "service": "guardian"
  },
  "collection_method": "prometheus_scrape",
  "scrape_duration_ms": 45
}

Metric categories:

system — CPU, memory, disk, network
application — request rates, latencies, error rates
business — receipts/hour, anchors/day, oracle queries
security — auth attempts, failed logins, blocked IPs
mesh — route latencies, node health, capability usage

2.2 Logs

Logs are structured event records from all system components.

{
  "log_id": "log:guardian:2025-12-06T14:30:15.123Z",
  "timestamp": "2025-12-06T14:30:15.123Z",
  "level": "info",
  "service": "guardian",
  "node": "did:vm:node:brick-01",
  "message": "Anchor cycle completed successfully",
  "attributes": {
    "cycle_id": "anchor-cycle-2025-12-06-001",
    "receipts_anchored": 47,
    "scrolls_included": ["treasury", "mesh", "identity"],
    "duration_ms": 1234,
    "backend": "bitcoin"
  },
  "trace_id": "trace-abc123...",
  "span_id": "span-def456...",
  "caller": "guardian/anchor.go:234"
}

Log levels:

trace — verbose debugging (not retained long-term)
debug — debugging information
info — normal operations
warn — unexpected but handled conditions
error — errors requiring attention
fatal — system failures

2.3 Traces

Traces track request flows across distributed components.

{
  "trace_id": "trace-abc123...",
  "name": "treasury_settlement",
  "start_time": "2025-12-06T14:30:00.000Z",
  "end_time": "2025-12-06T14:30:02.345Z",
  "duration_ms": 2345,
  "status": "ok",
  "spans": [
    {
      "span_id": "span-001",
      "parent_span_id": null,
      "name": "http_request",
      "service": "portal",
      "node": "did:vm:node:portal-01",
      "start_time": "2025-12-06T14:30:00.000Z",
      "duration_ms": 2340,
      "attributes": {
        "http.method": "POST",
        "http.url": "/treasury/settle",
        "http.status_code": 200
      }
    },
    {
      "span_id": "span-002",
      "parent_span_id": "span-001",
      "name": "validate_settlement",
      "service": "treasury-engine",
      "node": "did:vm:node:brick-01",
      "start_time": "2025-12-06T14:30:00.100Z",
      "duration_ms": 150,
      "attributes": {
        "settlement_id": "settle-2025-12-06-001",
        "accounts_involved": 3
      }
    },
    {
      "span_id": "span-003",
      "parent_span_id": "span-001",
      "name": "emit_receipt",
      "service": "ledger",
      "node": "did:vm:node:brick-01",
      "start_time": "2025-12-06T14:30:00.250Z",
      "duration_ms": 50,
      "attributes": {
        "receipt_type": "treasury_settlement",
        "scroll": "treasury"
      }
    },
    {
      "span_id": "span-004",
      "parent_span_id": "span-001",
      "name": "anchor_request",
      "service": "guardian",
      "node": "did:vm:node:brick-01",
      "start_time": "2025-12-06T14:30:00.300Z",
      "duration_ms": 2000,
      "attributes": {
        "backend": "bitcoin",
        "txid": "btc:abc123..."
      }
    }
  ],
  "tags": ["treasury", "settlement", "anchor"]
}

2.4 Alerts

Alerts are triggered conditions requiring attention.

{
  "alert_id": "alert-2025-12-06-001",
  "name": "HighCPUUsage",
  "severity": "warning",
  "status": "firing",
  "fired_at": "2025-12-06T14:35:00Z",
  "node": "did:vm:node:brick-02",
  "rule": {
    "expression": "cpu_percent > 80 for 5m",
    "threshold": 80,
    "duration": "5m"
  },
  "current_value": 87.3,
  "labels": {
    "environment": "production",
    "region": "eu-west"
  },
  "annotations": {
    "summary": "CPU usage above 80% for 5 minutes",
    "runbook": "https://docs.vaultmesh.io/runbooks/high-cpu"
  },
  "notified": ["slack:ops-channel", "pagerduty:on-call"]
}

2.5 SLO Reports

SLO (Service Level Objective) Reports track reliability targets.

{
  "slo_id": "slo:anchor-latency-p99",
  "name": "Anchor Latency P99",
  "description": "99th percentile anchor latency under 30 seconds",
  "target": 0.999,
  "window": "30d",
  "report_period": {
    "start": "2025-11-06T00:00:00Z",
    "end": "2025-12-06T00:00:00Z"
  },
  "achieved": 0.9995,
  "status": "met",
  "error_budget": {
    "total_minutes": 43.2,
    "consumed_minutes": 21.6,
    "remaining_percent": 50.0
  },
  "breakdown": {
    "total_requests": 125000,
    "good_requests": 124937,
    "bad_requests": 63
  },
  "trend": "stable"
}

3. Mapping to Eternal Pattern

3.1 Experience Layer (L1)

CLI (vm-obs):

# Metrics
vm-obs metrics query --node brick-01 --metric cpu_percent --last 1h
vm-obs metrics list --node brick-01
vm-obs metrics export --from 2025-12-01 --to 2025-12-06 --format prometheus

# Logs
vm-obs logs query --service guardian --level error --last 24h
vm-obs logs tail --node brick-01 --follow
vm-obs logs search "anchor failed" --from 2025-12-01

# Traces
vm-obs trace show trace-abc123
vm-obs trace search --service treasury --duration ">1s" --last 24h
vm-obs trace analyze trace-abc123 --find-bottleneck

# Alerts
vm-obs alert list --status firing
vm-obs alert show alert-2025-12-06-001
vm-obs alert ack alert-2025-12-06-001 --comment "investigating"
vm-obs alert silence --node brick-02 --duration 1h --reason "maintenance"

# SLOs
vm-obs slo list
vm-obs slo show slo:anchor-latency-p99
vm-obs slo report --period 30d --format markdown

# Dashboards
vm-obs dashboard list
vm-obs dashboard show system-overview
vm-obs dashboard export system-overview --format grafana

MCP Tools:

obs_metrics_query — query metrics for a node/service
obs_logs_search — search logs with filters
obs_trace_get — retrieve trace details
obs_alert_status — current alert status
obs_slo_summary — SLO compliance summary
obs_health_check — overall system health

Portal HTTP:

GET /obs/metrics — query metrics
GET /obs/logs — search logs
GET /obs/traces — list traces
GET /obs/traces/{trace_id} — trace details
GET /obs/alerts — list alerts
POST /obs/alerts/{id}/ack — acknowledge alert
POST /obs/alerts/silence — create silence
GET /obs/slos — list SLOs
GET /obs/slos/{id}/report — SLO report
GET /obs/health — system health

3.2 Engine Layer (L2)

Step 1 — Plan → Implicit (Continuous Collection)

Unlike discrete operations, observability collection is continuous. However, certain operations have explicit contracts:

Alert Acknowledgment Contract:

{
  "operation_id": "obs-op-2025-12-06-001",
  "operation_type": "alert_acknowledge",
  "alert_id": "alert-2025-12-06-001",
  "acknowledged_by": "did:vm:user:sovereign",
  "acknowledged_at": "2025-12-06T14:40:00Z",
  "comment": "Investigating high CPU on brick-02, likely due to anchor backlog",
  "escalation_suppressed": true,
  "follow_up_required": true,
  "follow_up_deadline": "2025-12-06T16:00:00Z"
}

SLO Definition Contract:

{
  "operation_id": "obs-op-2025-12-06-002",
  "operation_type": "slo_create",
  "initiated_by": "did:vm:user:sovereign",
  "slo": {
    "id": "slo:oracle-availability",
    "name": "Oracle Availability",
    "description": "Oracle service uptime",
    "indicator": {
      "type": "availability",
      "good_query": "oracle_up == 1",
      "total_query": "count(oracle_requests)"
    },
    "target": 0.999,
    "window": "30d"
  }
}

Step 2 — Execute → Continuous Collection

Metrics, logs, and traces are collected continuously via:

Prometheus scraping (metrics)
Fluent Bit/Vector (logs)
OpenTelemetry SDK (traces)

State is maintained in time-series databases and search indices, not as discrete state files.

Step 3 — Seal → Receipts

Metric Snapshot Receipt (hourly):

{
  "type": "obs_metric_snapshot",
  "snapshot_id": "metrics-2025-12-06-14",
  "timestamp": "2025-12-06T14:00:00Z",
  "period": {
    "start": "2025-12-06T13:00:00Z",
    "end": "2025-12-06T14:00:00Z"
  },
  "nodes_reporting": 5,
  "metrics_collected": 15000,
  "aggregates": {
    "avg_cpu_percent": 34.5,
    "max_cpu_percent": 87.3,
    "avg_memory_percent": 62.1,
    "total_receipts_emitted": 1247,
    "total_anchors_completed": 12
  },
  "storage_path": "telemetry/metrics/2025-12-06/hour-14.parquet",
  "content_hash": "blake3:aaa111...",
  "tags": ["observability", "metrics", "hourly"],
  "root_hash": "blake3:bbb222..."
}

Log Batch Receipt (hourly):

{
  "type": "obs_log_batch",
  "batch_id": "logs-2025-12-06-14",
  "timestamp": "2025-12-06T14:00:00Z",
  "period": {
    "start": "2025-12-06T13:00:00Z",
    "end": "2025-12-06T14:00:00Z"
  },
  "log_counts": {
    "trace": 0,
    "debug": 12456,
    "info": 45678,
    "warn": 234,
    "error": 12,
    "fatal": 0
  },
  "services_reporting": ["guardian", "treasury", "portal", "oracle", "mesh"],
  "storage_path": "telemetry/logs/2025-12-06/hour-14.jsonl.gz",
  "content_hash": "blake3:ccc333...",
  "tags": ["observability", "logs", "hourly"],
  "root_hash": "blake3:ddd444..."
}

Trace Complete Receipt (for significant traces):

{
  "type": "obs_trace_complete",
  "trace_id": "trace-abc123...",
  "timestamp": "2025-12-06T14:30:02.345Z",
  "name": "treasury_settlement",
  "duration_ms": 2345,
  "status": "ok",
  "span_count": 4,
  "services_involved": ["portal", "treasury-engine", "ledger", "guardian"],
  "nodes_involved": ["portal-01", "brick-01"],
  "triggered_by": "did:vm:user:sovereign",
  "business_context": {
    "settlement_id": "settle-2025-12-06-001",
    "amount": "1000.00 USD"
  },
  "tags": ["observability", "trace", "treasury", "settlement"],
  "root_hash": "blake3:eee555..."
}

Alert Fired Receipt:

{
  "type": "obs_alert_fired",
  "alert_id": "alert-2025-12-06-001",
  "timestamp": "2025-12-06T14:35:00Z",
  "name": "HighCPUUsage",
  "severity": "warning",
  "node": "did:vm:node:brick-02",
  "rule_expression": "cpu_percent > 80 for 5m",
  "current_value": 87.3,
  "threshold": 80,
  "notifications_sent": ["slack:ops-channel", "pagerduty:on-call"],
  "tags": ["observability", "alert", "fired", "cpu"],
  "root_hash": "blake3:fff666..."
}

Alert Resolved Receipt:

{
  "type": "obs_alert_resolved",
  "alert_id": "alert-2025-12-06-001",
  "timestamp": "2025-12-06T15:10:00Z",
  "name": "HighCPUUsage",
  "fired_at": "2025-12-06T14:35:00Z",
  "duration_minutes": 35,
  "resolved_by": "automatic",
  "resolution_value": 42.1,
  "acknowledged": true,
  "acknowledged_by": "did:vm:user:sovereign",
  "root_cause": "anchor backlog cleared",
  "tags": ["observability", "alert", "resolved"],
  "root_hash": "blake3:ggg777..."
}

SLO Report Receipt (daily):

{
  "type": "obs_slo_report",
  "report_id": "slo-report-2025-12-06",
  "timestamp": "2025-12-06T00:00:00Z",
  "period": {
    "start": "2025-11-06T00:00:00Z",
    "end": "2025-12-06T00:00:00Z"
  },
  "slos": [
    {
      "slo_id": "slo:anchor-latency-p99",
      "target": 0.999,
      "achieved": 0.9995,
      "status": "met"
    },
    {
      "slo_id": "slo:oracle-availability",
      "target": 0.999,
      "achieved": 0.9987,
      "status": "at_risk"
    }
  ],
  "overall_status": "healthy",
  "error_budget_status": "sufficient",
  "report_path": "reports/slo/2025-12-06.json",
  "tags": ["observability", "slo", "daily-report"],
  "root_hash": "blake3:hhh888..."
}

Anomaly Detection Receipt:

{
  "type": "obs_anomaly_detected",
  "anomaly_id": "anomaly-2025-12-06-001",
  "timestamp": "2025-12-06T14:45:00Z",
  "detection_method": "statistical",
  "metric": "treasury.receipts_per_minute",
  "node": "did:vm:node:brick-01",
  "expected_range": {"min": 10, "max": 50},
  "observed_value": 2,
  "deviation_sigma": 4.2,
  "confidence": 0.98,
  "possible_causes": [
    "upstream service degradation",
    "network partition",
    "configuration change"
  ],
  "correlated_events": ["alert-2025-12-06-001"],
  "tags": ["observability", "anomaly", "treasury"],
  "root_hash": "blake3:iii999..."
}

3.3 Ledger Layer (L3)

Receipt Types:

Type	When Emitted
`obs_metric_snapshot`	Hourly metric aggregation
`obs_log_batch`	Hourly log batch sealed
`obs_trace_complete`	Significant trace completed
`obs_alert_fired`	Alert triggered
`obs_alert_resolved`	Alert resolved
`obs_slo_report`	Daily SLO report
`obs_anomaly_detected`	Statistical anomaly detected

Merkle Coverage:

All receipts append to receipts/observability/observability_events.jsonl
ROOT.observability.txt updated after each append
Guardian anchors Observability root in anchor cycles

4. Query Interface

observability_query_events.py:

# Metric snapshots
vm-obs query --type metric_snapshot --from 2025-12-01

# Log batches with errors
vm-obs query --type log_batch --filter "log_counts.error > 0"

# Traces over 5 seconds
vm-obs query --type trace_complete --filter "duration_ms > 5000"

# All alerts for a node
vm-obs query --type alert_fired,alert_resolved --node brick-02

# SLO reports with missed targets
vm-obs query --type slo_report --filter "overall_status != 'healthy'"

# Anomalies in last 7 days
vm-obs query --type anomaly_detected --last 7d

# Export for analysis
vm-obs query --from 2025-12-01 --format parquet > observability_dec.parquet

Correlation Tool:

# Correlate events around a timestamp
vm-obs correlate --timestamp "2025-12-06T14:35:00Z" --window 15m

# Output:
# Timeline around 2025-12-06T14:35:00Z (±15m):
#
# 14:20:00 [metric] brick-02 cpu_percent starts rising
# 14:25:00 [log] guardian: "anchor queue depth increasing"
# 14:30:00 [trace] trace-abc123 completed (2345ms, normal)
# 14:32:00 [metric] brick-02 cpu_percent crosses 80%
# 14:35:00 [alert] HighCPUUsage fired on brick-02
# 14:40:00 [log] guardian: "processing backlog"
# 14:45:00 [anomaly] treasury.receipts_per_minute low
# 14:50:00 [log] guardian: "backlog cleared"
# 15:10:00 [alert] HighCPUUsage resolved on brick-02

5. Design Gate Checklist

Question	Observability Answer
Clear entrypoint?	✅ CLI (`vm-obs`), MCP tools, Portal HTTP
Contract produced?	✅ Implicit (continuous) + explicit for alert acks, SLO definitions
State object?	✅ Time-series DBs, search indices (continuous state)
Receipts emitted?	✅ Seven receipt types covering all observability events
Append-only JSONL?	✅ `receipts/observability/observability_events.jsonl`
Merkle root?	✅ `ROOT.observability.txt`
Guardian anchor path?	✅ Observability root included in ProofChain
Query tool?	✅ `observability_query_events.py` + correlation tool

6. Data Pipeline

6.1 Collection Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         BRICK Nodes                              │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐            │
│  │ brick-01│  │ brick-02│  │ brick-03│  │portal-01│            │
│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘            │
│       │            │            │            │                   │
│       ▼            ▼            ▼            ▼                   │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Collection Layer                      │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐  │    │
│  │  │Prometheus│  │Fluent Bit│  │OpenTelemetry Collector│  │    │
│  │  │ (metrics)│  │  (logs)  │  │       (traces)        │  │    │
│  │  └────┬─────┘  └────┬─────┘  └──────────┬───────────┘  │    │
│  └───────┼─────────────┼───────────────────┼──────────────┘    │
│          │             │                   │                     │
│          ▼             ▼                   ▼                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    Storage Layer                         │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐  │    │
│  │  │VictoriaM │  │ Loki/    │  │    Tempo/Jaeger      │  │    │
│  │  │(metrics) │  │ OpenSearch│  │      (traces)        │  │    │
│  │  └────┬─────┘  └────┬─────┘  └──────────┬───────────┘  │    │
│  └───────┼─────────────┼───────────────────┼──────────────┘    │
│          │             │                   │                     │
│          ▼             ▼                   ▼                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   Receipt Layer                          │    │
│  │  ┌──────────────────────────────────────────────────┐  │    │
│  │  │           Observability Receipt Emitter           │  │    │
│  │  │   (hourly snapshots, alerts, SLOs, anomalies)     │  │    │
│  │  └──────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

6.2 Retention Policies

Data Type	Hot Storage	Warm Storage	Cold/Archive	Receipt
Metrics (raw)	7 days	30 days	1 year	Hourly
Metrics (1h agg)	30 days	1 year	5 years	Hourly
Logs (all)	7 days	30 days	1 year	Hourly
Logs (error+)	30 days	1 year	5 years	Hourly
Traces (sampled)	7 days	30 days	—	Per-trace
Traces (errors)	30 days	1 year	5 years	Per-trace
Alerts	Indefinite	Indefinite	Indefinite	Per-event
SLO Reports	Indefinite	Indefinite	Indefinite	Daily

6.3 Sampling Strategy

{
  "sampling_rules": [
    {
      "name": "always_sample_errors",
      "condition": "status == 'error' OR level >= 'error'",
      "rate": 1.0
    },
    {
      "name": "always_sample_slow",
      "condition": "duration_ms > 5000",
      "rate": 1.0
    },
    {
      "name": "always_sample_sensitive",
      "condition": "service IN ['treasury', 'identity', 'offsec']",
      "rate": 1.0
    },
    {
      "name": "default_traces",
      "condition": "true",
      "rate": 0.1
    }
  ]
}

7. Alerting Framework

7.1 Alert Rules

groups:
  - name: vaultmesh-critical
    rules:
      - alert: NodeDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} is down"
          runbook: https://docs.vaultmesh.io/runbooks/node-down

      - alert: AnchorBacklogHigh
        expr: guardian_anchor_queue_depth > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Anchor queue depth is {{ $value }}"

      - alert: SLOBudgetBurning
        expr: slo_error_budget_remaining_percent < 25
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SLO {{ $labels.slo }} error budget at {{ $value }}%"

7.2 Notification Channels

Severity	Channels	Response Time
`critical`	PagerDuty, SMS, Slack #critical	Immediate
`high`	PagerDuty, Slack #alerts	15 minutes
`warning`	Slack #alerts, Email	1 hour
`info`	Slack #ops	Best effort

8. Integration Points

System	Integration
Guardian	Emits anchor metrics/traces; alerts on anchor failures
Treasury	Transaction metrics; latency SLOs; receipt throughput
Identity	Auth event logs; failed login alerts; session metrics
Mesh	Node health metrics; route latency; topology change logs
OffSec	Security event correlation; incident timeline enrichment
Oracle	Query latency metrics; confidence score distributions
Automation	Workflow execution traces; n8n performance metrics

9. Future Extensions

AI-powered anomaly detection: ML models for predictive alerting
Distributed tracing visualization: Real-time trace graphs in Portal
Log pattern mining: Automatic extraction of error patterns
Chaos engineering integration: Correlate chaos experiments with observability
Cost attribution: Resource usage per scroll/service for Treasury billing
Compliance dashboards: Real-time compliance posture visualization

25 KiB Raw Blame History