743 lines
25 KiB
Markdown
743 lines
25 KiB
Markdown
# VAULTMESH-OBSERVABILITY-ENGINE.md
|
|
|
|
**Civilization Ledger Telemetry Primitive**
|
|
|
|
> *Every metric tells a story. Every trace has a receipt.*
|
|
|
|
Observability is VaultMesh's nervous system — capturing metrics, logs, and traces across all nodes and services, with cryptographic attestation that the telemetry itself hasn't been tampered with.
|
|
|
|
---
|
|
|
|
## 1. Scroll Definition
|
|
|
|
| Property | Value |
|
|
| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| **Scroll Name** | `Observability` |
|
|
| **JSONL Path** | `receipts/observability/observability_events.jsonl` |
|
|
| **Root File** | `ROOT.observability.txt` |
|
|
| **Receipt Types** | `obs_metric_snapshot`, `obs_log_batch`, `obs_trace_complete`, `obs_alert_fired`, `obs_alert_resolved`, `obs_slo_report`, `obs_anomaly_detected` |
|
|
|
|
---
|
|
|
|
## 2. Core Concepts
|
|
|
|
### 2.1 Metrics
|
|
|
|
**Metrics** are time-series numerical measurements from nodes and services.
|
|
|
|
```json
|
|
{
|
|
"metric_id": "metric:brick-01:cpu:2025-12-06T14:30:00Z",
|
|
"node": "did:vm:node:brick-01",
|
|
"timestamp": "2025-12-06T14:30:00Z",
|
|
"metrics": {
|
|
"cpu_percent": 23.5,
|
|
"memory_percent": 67.2,
|
|
"disk_percent": 45.8,
|
|
"network_rx_bytes": 1234567890,
|
|
"network_tx_bytes": 987654321,
|
|
"open_file_descriptors": 342,
|
|
"goroutines": 156
|
|
},
|
|
"labels": {
|
|
"environment": "production",
|
|
"region": "eu-west",
|
|
"service": "guardian"
|
|
},
|
|
"collection_method": "prometheus_scrape",
|
|
"scrape_duration_ms": 45
|
|
}
|
|
```
|
|
|
|
**Metric categories**:
|
|
- `system` — CPU, memory, disk, network
|
|
- `application` — request rates, latencies, error rates
|
|
- `business` — receipts/hour, anchors/day, oracle queries
|
|
- `security` — auth attempts, failed logins, blocked IPs
|
|
- `mesh` — route latencies, node health, capability usage
|
|
|
|
### 2.2 Logs
|
|
|
|
**Logs** are structured event records from all system components.
|
|
|
|
```json
|
|
{
|
|
"log_id": "log:guardian:2025-12-06T14:30:15.123Z",
|
|
"timestamp": "2025-12-06T14:30:15.123Z",
|
|
"level": "info",
|
|
"service": "guardian",
|
|
"node": "did:vm:node:brick-01",
|
|
"message": "Anchor cycle completed successfully",
|
|
"attributes": {
|
|
"cycle_id": "anchor-cycle-2025-12-06-001",
|
|
"receipts_anchored": 47,
|
|
"scrolls_included": ["treasury", "mesh", "identity"],
|
|
"duration_ms": 1234,
|
|
"backend": "bitcoin"
|
|
},
|
|
"trace_id": "trace-abc123...",
|
|
"span_id": "span-def456...",
|
|
"caller": "guardian/anchor.go:234"
|
|
}
|
|
```
|
|
|
|
**Log levels**:
|
|
- `trace` — verbose debugging (not retained long-term)
|
|
- `debug` — debugging information
|
|
- `info` — normal operations
|
|
- `warn` — unexpected but handled conditions
|
|
- `error` — errors requiring attention
|
|
- `fatal` — system failures
|
|
|
|
### 2.3 Traces
|
|
|
|
**Traces** track request flows across distributed components.
|
|
|
|
```json
|
|
{
|
|
"trace_id": "trace-abc123...",
|
|
"name": "treasury_settlement",
|
|
"start_time": "2025-12-06T14:30:00.000Z",
|
|
"end_time": "2025-12-06T14:30:02.345Z",
|
|
"duration_ms": 2345,
|
|
"status": "ok",
|
|
"spans": [
|
|
{
|
|
"span_id": "span-001",
|
|
"parent_span_id": null,
|
|
"name": "http_request",
|
|
"service": "portal",
|
|
"node": "did:vm:node:portal-01",
|
|
"start_time": "2025-12-06T14:30:00.000Z",
|
|
"duration_ms": 2340,
|
|
"attributes": {
|
|
"http.method": "POST",
|
|
"http.url": "/treasury/settle",
|
|
"http.status_code": 200
|
|
}
|
|
},
|
|
{
|
|
"span_id": "span-002",
|
|
"parent_span_id": "span-001",
|
|
"name": "validate_settlement",
|
|
"service": "treasury-engine",
|
|
"node": "did:vm:node:brick-01",
|
|
"start_time": "2025-12-06T14:30:00.100Z",
|
|
"duration_ms": 150,
|
|
"attributes": {
|
|
"settlement_id": "settle-2025-12-06-001",
|
|
"accounts_involved": 3
|
|
}
|
|
},
|
|
{
|
|
"span_id": "span-003",
|
|
"parent_span_id": "span-001",
|
|
"name": "emit_receipt",
|
|
"service": "ledger",
|
|
"node": "did:vm:node:brick-01",
|
|
"start_time": "2025-12-06T14:30:00.250Z",
|
|
"duration_ms": 50,
|
|
"attributes": {
|
|
"receipt_type": "treasury_settlement",
|
|
"scroll": "treasury"
|
|
}
|
|
},
|
|
{
|
|
"span_id": "span-004",
|
|
"parent_span_id": "span-001",
|
|
"name": "anchor_request",
|
|
"service": "guardian",
|
|
"node": "did:vm:node:brick-01",
|
|
"start_time": "2025-12-06T14:30:00.300Z",
|
|
"duration_ms": 2000,
|
|
"attributes": {
|
|
"backend": "bitcoin",
|
|
"txid": "btc:abc123..."
|
|
}
|
|
}
|
|
],
|
|
"tags": ["treasury", "settlement", "anchor"]
|
|
}
|
|
```
|
|
|
|
### 2.4 Alerts
|
|
|
|
**Alerts** are triggered conditions requiring attention.
|
|
|
|
```json
|
|
{
|
|
"alert_id": "alert-2025-12-06-001",
|
|
"name": "HighCPUUsage",
|
|
"severity": "warning",
|
|
"status": "firing",
|
|
"fired_at": "2025-12-06T14:35:00Z",
|
|
"node": "did:vm:node:brick-02",
|
|
"rule": {
|
|
"expression": "cpu_percent > 80 for 5m",
|
|
"threshold": 80,
|
|
"duration": "5m"
|
|
},
|
|
"current_value": 87.3,
|
|
"labels": {
|
|
"environment": "production",
|
|
"region": "eu-west"
|
|
},
|
|
"annotations": {
|
|
"summary": "CPU usage above 80% for 5 minutes",
|
|
"runbook": "https://docs.vaultmesh.io/runbooks/high-cpu"
|
|
},
|
|
"notified": ["slack:ops-channel", "pagerduty:on-call"]
|
|
}
|
|
```
|
|
|
|
### 2.5 SLO Reports
|
|
|
|
**SLO (Service Level Objective) Reports** track reliability targets.
|
|
|
|
```json
|
|
{
|
|
"slo_id": "slo:anchor-latency-p99",
|
|
"name": "Anchor Latency P99",
|
|
"description": "99th percentile anchor latency under 30 seconds",
|
|
"target": 0.999,
|
|
"window": "30d",
|
|
"report_period": {
|
|
"start": "2025-11-06T00:00:00Z",
|
|
"end": "2025-12-06T00:00:00Z"
|
|
},
|
|
"achieved": 0.9995,
|
|
"status": "met",
|
|
"error_budget": {
|
|
"total_minutes": 43.2,
|
|
"consumed_minutes": 21.6,
|
|
"remaining_percent": 50.0
|
|
},
|
|
"breakdown": {
|
|
"total_requests": 125000,
|
|
"good_requests": 124937,
|
|
"bad_requests": 63
|
|
},
|
|
"trend": "stable"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Mapping to Eternal Pattern
|
|
|
|
### 3.1 Experience Layer (L1)
|
|
|
|
**CLI** (`vm-obs`):
|
|
```bash
|
|
# Metrics
|
|
vm-obs metrics query --node brick-01 --metric cpu_percent --last 1h
|
|
vm-obs metrics list --node brick-01
|
|
vm-obs metrics export --from 2025-12-01 --to 2025-12-06 --format prometheus
|
|
|
|
# Logs
|
|
vm-obs logs query --service guardian --level error --last 24h
|
|
vm-obs logs tail --node brick-01 --follow
|
|
vm-obs logs search "anchor failed" --from 2025-12-01
|
|
|
|
# Traces
|
|
vm-obs trace show trace-abc123
|
|
vm-obs trace search --service treasury --duration ">1s" --last 24h
|
|
vm-obs trace analyze trace-abc123 --find-bottleneck
|
|
|
|
# Alerts
|
|
vm-obs alert list --status firing
|
|
vm-obs alert show alert-2025-12-06-001
|
|
vm-obs alert ack alert-2025-12-06-001 --comment "investigating"
|
|
vm-obs alert silence --node brick-02 --duration 1h --reason "maintenance"
|
|
|
|
# SLOs
|
|
vm-obs slo list
|
|
vm-obs slo show slo:anchor-latency-p99
|
|
vm-obs slo report --period 30d --format markdown
|
|
|
|
# Dashboards
|
|
vm-obs dashboard list
|
|
vm-obs dashboard show system-overview
|
|
vm-obs dashboard export system-overview --format grafana
|
|
```
|
|
|
|
**MCP Tools**:
|
|
- `obs_metrics_query` — query metrics for a node/service
|
|
- `obs_logs_search` — search logs with filters
|
|
- `obs_trace_get` — retrieve trace details
|
|
- `obs_alert_status` — current alert status
|
|
- `obs_slo_summary` — SLO compliance summary
|
|
- `obs_health_check` — overall system health
|
|
|
|
**Portal HTTP**:
|
|
- `GET /obs/metrics` — query metrics
|
|
- `GET /obs/logs` — search logs
|
|
- `GET /obs/traces` — list traces
|
|
- `GET /obs/traces/{trace_id}` — trace details
|
|
- `GET /obs/alerts` — list alerts
|
|
- `POST /obs/alerts/{id}/ack` — acknowledge alert
|
|
- `POST /obs/alerts/silence` — create silence
|
|
- `GET /obs/slos` — list SLOs
|
|
- `GET /obs/slos/{id}/report` — SLO report
|
|
- `GET /obs/health` — system health
|
|
|
|
---
|
|
|
|
### 3.2 Engine Layer (L2)
|
|
|
|
#### Step 1 — Plan → Implicit (Continuous Collection)
|
|
|
|
Unlike discrete operations, observability collection is continuous. However, certain operations have explicit contracts:
|
|
|
|
**Alert Acknowledgment Contract**:
|
|
```json
|
|
{
|
|
"operation_id": "obs-op-2025-12-06-001",
|
|
"operation_type": "alert_acknowledge",
|
|
"alert_id": "alert-2025-12-06-001",
|
|
"acknowledged_by": "did:vm:user:sovereign",
|
|
"acknowledged_at": "2025-12-06T14:40:00Z",
|
|
"comment": "Investigating high CPU on brick-02, likely due to anchor backlog",
|
|
"escalation_suppressed": true,
|
|
"follow_up_required": true,
|
|
"follow_up_deadline": "2025-12-06T16:00:00Z"
|
|
}
|
|
```
|
|
|
|
**SLO Definition Contract**:
|
|
```json
|
|
{
|
|
"operation_id": "obs-op-2025-12-06-002",
|
|
"operation_type": "slo_create",
|
|
"initiated_by": "did:vm:user:sovereign",
|
|
"slo": {
|
|
"id": "slo:oracle-availability",
|
|
"name": "Oracle Availability",
|
|
"description": "Oracle service uptime",
|
|
"indicator": {
|
|
"type": "availability",
|
|
"good_query": "oracle_up == 1",
|
|
"total_query": "count(oracle_requests)"
|
|
},
|
|
"target": 0.999,
|
|
"window": "30d"
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Step 2 — Execute → Continuous Collection
|
|
|
|
Metrics, logs, and traces are collected continuously via:
|
|
- Prometheus scraping (metrics)
|
|
- Fluent Bit/Vector (logs)
|
|
- OpenTelemetry SDK (traces)
|
|
|
|
State is maintained in time-series databases and search indices, not as discrete state files.
|
|
|
|
#### Step 3 — Seal → Receipts
|
|
|
|
**Metric Snapshot Receipt** (hourly):
|
|
```json
|
|
{
|
|
"type": "obs_metric_snapshot",
|
|
"snapshot_id": "metrics-2025-12-06-14",
|
|
"timestamp": "2025-12-06T14:00:00Z",
|
|
"period": {
|
|
"start": "2025-12-06T13:00:00Z",
|
|
"end": "2025-12-06T14:00:00Z"
|
|
},
|
|
"nodes_reporting": 5,
|
|
"metrics_collected": 15000,
|
|
"aggregates": {
|
|
"avg_cpu_percent": 34.5,
|
|
"max_cpu_percent": 87.3,
|
|
"avg_memory_percent": 62.1,
|
|
"total_receipts_emitted": 1247,
|
|
"total_anchors_completed": 12
|
|
},
|
|
"storage_path": "telemetry/metrics/2025-12-06/hour-14.parquet",
|
|
"content_hash": "blake3:aaa111...",
|
|
"tags": ["observability", "metrics", "hourly"],
|
|
"root_hash": "blake3:bbb222..."
|
|
}
|
|
```
|
|
|
|
**Log Batch Receipt** (hourly):
|
|
```json
|
|
{
|
|
"type": "obs_log_batch",
|
|
"batch_id": "logs-2025-12-06-14",
|
|
"timestamp": "2025-12-06T14:00:00Z",
|
|
"period": {
|
|
"start": "2025-12-06T13:00:00Z",
|
|
"end": "2025-12-06T14:00:00Z"
|
|
},
|
|
"log_counts": {
|
|
"trace": 0,
|
|
"debug": 12456,
|
|
"info": 45678,
|
|
"warn": 234,
|
|
"error": 12,
|
|
"fatal": 0
|
|
},
|
|
"services_reporting": ["guardian", "treasury", "portal", "oracle", "mesh"],
|
|
"storage_path": "telemetry/logs/2025-12-06/hour-14.jsonl.gz",
|
|
"content_hash": "blake3:ccc333...",
|
|
"tags": ["observability", "logs", "hourly"],
|
|
"root_hash": "blake3:ddd444..."
|
|
}
|
|
```
|
|
|
|
**Trace Complete Receipt** (for significant traces):
|
|
```json
|
|
{
|
|
"type": "obs_trace_complete",
|
|
"trace_id": "trace-abc123...",
|
|
"timestamp": "2025-12-06T14:30:02.345Z",
|
|
"name": "treasury_settlement",
|
|
"duration_ms": 2345,
|
|
"status": "ok",
|
|
"span_count": 4,
|
|
"services_involved": ["portal", "treasury-engine", "ledger", "guardian"],
|
|
"nodes_involved": ["portal-01", "brick-01"],
|
|
"triggered_by": "did:vm:user:sovereign",
|
|
"business_context": {
|
|
"settlement_id": "settle-2025-12-06-001",
|
|
"amount": "1000.00 USD"
|
|
},
|
|
"tags": ["observability", "trace", "treasury", "settlement"],
|
|
"root_hash": "blake3:eee555..."
|
|
}
|
|
```
|
|
|
|
**Alert Fired Receipt**:
|
|
```json
|
|
{
|
|
"type": "obs_alert_fired",
|
|
"alert_id": "alert-2025-12-06-001",
|
|
"timestamp": "2025-12-06T14:35:00Z",
|
|
"name": "HighCPUUsage",
|
|
"severity": "warning",
|
|
"node": "did:vm:node:brick-02",
|
|
"rule_expression": "cpu_percent > 80 for 5m",
|
|
"current_value": 87.3,
|
|
"threshold": 80,
|
|
"notifications_sent": ["slack:ops-channel", "pagerduty:on-call"],
|
|
"tags": ["observability", "alert", "fired", "cpu"],
|
|
"root_hash": "blake3:fff666..."
|
|
}
|
|
```
|
|
|
|
**Alert Resolved Receipt**:
|
|
```json
|
|
{
|
|
"type": "obs_alert_resolved",
|
|
"alert_id": "alert-2025-12-06-001",
|
|
"timestamp": "2025-12-06T15:10:00Z",
|
|
"name": "HighCPUUsage",
|
|
"fired_at": "2025-12-06T14:35:00Z",
|
|
"duration_minutes": 35,
|
|
"resolved_by": "automatic",
|
|
"resolution_value": 42.1,
|
|
"acknowledged": true,
|
|
"acknowledged_by": "did:vm:user:sovereign",
|
|
"root_cause": "anchor backlog cleared",
|
|
"tags": ["observability", "alert", "resolved"],
|
|
"root_hash": "blake3:ggg777..."
|
|
}
|
|
```
|
|
|
|
**SLO Report Receipt** (daily):
|
|
```json
|
|
{
|
|
"type": "obs_slo_report",
|
|
"report_id": "slo-report-2025-12-06",
|
|
"timestamp": "2025-12-06T00:00:00Z",
|
|
"period": {
|
|
"start": "2025-11-06T00:00:00Z",
|
|
"end": "2025-12-06T00:00:00Z"
|
|
},
|
|
"slos": [
|
|
{
|
|
"slo_id": "slo:anchor-latency-p99",
|
|
"target": 0.999,
|
|
"achieved": 0.9995,
|
|
"status": "met"
|
|
},
|
|
{
|
|
"slo_id": "slo:oracle-availability",
|
|
"target": 0.999,
|
|
"achieved": 0.9987,
|
|
"status": "at_risk"
|
|
}
|
|
],
|
|
"overall_status": "healthy",
|
|
"error_budget_status": "sufficient",
|
|
"report_path": "reports/slo/2025-12-06.json",
|
|
"tags": ["observability", "slo", "daily-report"],
|
|
"root_hash": "blake3:hhh888..."
|
|
}
|
|
```
|
|
|
|
**Anomaly Detection Receipt**:
|
|
```json
|
|
{
|
|
"type": "obs_anomaly_detected",
|
|
"anomaly_id": "anomaly-2025-12-06-001",
|
|
"timestamp": "2025-12-06T14:45:00Z",
|
|
"detection_method": "statistical",
|
|
"metric": "treasury.receipts_per_minute",
|
|
"node": "did:vm:node:brick-01",
|
|
"expected_range": {"min": 10, "max": 50},
|
|
"observed_value": 2,
|
|
"deviation_sigma": 4.2,
|
|
"confidence": 0.98,
|
|
"possible_causes": [
|
|
"upstream service degradation",
|
|
"network partition",
|
|
"configuration change"
|
|
],
|
|
"correlated_events": ["alert-2025-12-06-001"],
|
|
"tags": ["observability", "anomaly", "treasury"],
|
|
"root_hash": "blake3:iii999..."
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### 3.3 Ledger Layer (L3)
|
|
|
|
**Receipt Types**:
|
|
|
|
| Type | When Emitted |
|
|
| ---------------------- | ------------------------------------- |
|
|
| `obs_metric_snapshot` | Hourly metric aggregation |
|
|
| `obs_log_batch` | Hourly log batch sealed |
|
|
| `obs_trace_complete` | Significant trace completed |
|
|
| `obs_alert_fired` | Alert triggered |
|
|
| `obs_alert_resolved` | Alert resolved |
|
|
| `obs_slo_report` | Daily SLO report |
|
|
| `obs_anomaly_detected` | Statistical anomaly detected |
|
|
|
|
**Merkle Coverage**:
|
|
- All receipts append to `receipts/observability/observability_events.jsonl`
|
|
- `ROOT.observability.txt` updated after each append
|
|
- Guardian anchors Observability root in anchor cycles
|
|
|
|
---
|
|
|
|
## 4. Query Interface
|
|
|
|
`observability_query_events.py`:
|
|
|
|
```bash
|
|
# Metric snapshots
|
|
vm-obs query --type metric_snapshot --from 2025-12-01
|
|
|
|
# Log batches with errors
|
|
vm-obs query --type log_batch --filter "log_counts.error > 0"
|
|
|
|
# Traces over 5 seconds
|
|
vm-obs query --type trace_complete --filter "duration_ms > 5000"
|
|
|
|
# All alerts for a node
|
|
vm-obs query --type alert_fired,alert_resolved --node brick-02
|
|
|
|
# SLO reports with missed targets
|
|
vm-obs query --type slo_report --filter "overall_status != 'healthy'"
|
|
|
|
# Anomalies in last 7 days
|
|
vm-obs query --type anomaly_detected --last 7d
|
|
|
|
# Export for analysis
|
|
vm-obs query --from 2025-12-01 --format parquet > observability_dec.parquet
|
|
```
|
|
|
|
**Correlation Tool**:
|
|
```bash
|
|
# Correlate events around a timestamp
|
|
vm-obs correlate --timestamp "2025-12-06T14:35:00Z" --window 15m
|
|
|
|
# Output:
|
|
# Timeline around 2025-12-06T14:35:00Z (±15m):
|
|
#
|
|
# 14:20:00 [metric] brick-02 cpu_percent starts rising
|
|
# 14:25:00 [log] guardian: "anchor queue depth increasing"
|
|
# 14:30:00 [trace] trace-abc123 completed (2345ms, normal)
|
|
# 14:32:00 [metric] brick-02 cpu_percent crosses 80%
|
|
# 14:35:00 [alert] HighCPUUsage fired on brick-02
|
|
# 14:40:00 [log] guardian: "processing backlog"
|
|
# 14:45:00 [anomaly] treasury.receipts_per_minute low
|
|
# 14:50:00 [log] guardian: "backlog cleared"
|
|
# 15:10:00 [alert] HighCPUUsage resolved on brick-02
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Design Gate Checklist
|
|
|
|
| Question | Observability Answer |
|
|
| --------------------- | ------------------------------------------------------------------ |
|
|
| Clear entrypoint? | ✅ CLI (`vm-obs`), MCP tools, Portal HTTP |
|
|
| Contract produced? | ✅ Implicit (continuous) + explicit for alert acks, SLO definitions |
|
|
| State object? | ✅ Time-series DBs, search indices (continuous state) |
|
|
| Receipts emitted? | ✅ Seven receipt types covering all observability events |
|
|
| Append-only JSONL? | ✅ `receipts/observability/observability_events.jsonl` |
|
|
| Merkle root? | ✅ `ROOT.observability.txt` |
|
|
| Guardian anchor path? | ✅ Observability root included in ProofChain |
|
|
| Query tool? | ✅ `observability_query_events.py` + correlation tool |
|
|
|
|
---
|
|
|
|
## 6. Data Pipeline
|
|
|
|
### 6.1 Collection Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ BRICK Nodes │
|
|
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
|
│ │ brick-01│ │ brick-02│ │ brick-03│ │portal-01│ │
|
|
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
|
|
│ │ │ │ │ │
|
|
│ ▼ ▼ ▼ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ Collection Layer │ │
|
|
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │ │
|
|
│ │ │Prometheus│ │Fluent Bit│ │OpenTelemetry Collector│ │ │
|
|
│ │ │ (metrics)│ │ (logs) │ │ (traces) │ │ │
|
|
│ │ └────┬─────┘ └────┬─────┘ └──────────┬───────────┘ │ │
|
|
│ └───────┼─────────────┼───────────────────┼──────────────┘ │
|
|
│ │ │ │ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ Storage Layer │ │
|
|
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │ │
|
|
│ │ │VictoriaM │ │ Loki/ │ │ Tempo/Jaeger │ │ │
|
|
│ │ │(metrics) │ │ OpenSearch│ │ (traces) │ │ │
|
|
│ │ └────┬─────┘ └────┬─────┘ └──────────┬───────────┘ │ │
|
|
│ └───────┼─────────────┼───────────────────┼──────────────┘ │
|
|
│ │ │ │ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────┐ │
|
|
│ │ Receipt Layer │ │
|
|
│ │ ┌──────────────────────────────────────────────────┐ │ │
|
|
│ │ │ Observability Receipt Emitter │ │ │
|
|
│ │ │ (hourly snapshots, alerts, SLOs, anomalies) │ │ │
|
|
│ │ └──────────────────────────────────────────────────┘ │ │
|
|
│ └─────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 6.2 Retention Policies
|
|
|
|
| Data Type | Hot Storage | Warm Storage | Cold/Archive | Receipt |
|
|
| ------------------ | -------------- | -------------- | -------------- | ------- |
|
|
| Metrics (raw) | 7 days | 30 days | 1 year | Hourly |
|
|
| Metrics (1h agg) | 30 days | 1 year | 5 years | Hourly |
|
|
| Logs (all) | 7 days | 30 days | 1 year | Hourly |
|
|
| Logs (error+) | 30 days | 1 year | 5 years | Hourly |
|
|
| Traces (sampled) | 7 days | 30 days | — | Per-trace |
|
|
| Traces (errors) | 30 days | 1 year | 5 years | Per-trace |
|
|
| Alerts | Indefinite | Indefinite | Indefinite | Per-event |
|
|
| SLO Reports | Indefinite | Indefinite | Indefinite | Daily |
|
|
|
|
### 6.3 Sampling Strategy
|
|
|
|
```json
|
|
{
|
|
"sampling_rules": [
|
|
{
|
|
"name": "always_sample_errors",
|
|
"condition": "status == 'error' OR level >= 'error'",
|
|
"rate": 1.0
|
|
},
|
|
{
|
|
"name": "always_sample_slow",
|
|
"condition": "duration_ms > 5000",
|
|
"rate": 1.0
|
|
},
|
|
{
|
|
"name": "always_sample_sensitive",
|
|
"condition": "service IN ['treasury', 'identity', 'offsec']",
|
|
"rate": 1.0
|
|
},
|
|
{
|
|
"name": "default_traces",
|
|
"condition": "true",
|
|
"rate": 0.1
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Alerting Framework
|
|
|
|
### 7.1 Alert Rules
|
|
|
|
```yaml
|
|
groups:
|
|
- name: vaultmesh-critical
|
|
rules:
|
|
- alert: NodeDown
|
|
expr: up == 0
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Node {{ $labels.node }} is down"
|
|
runbook: https://docs.vaultmesh.io/runbooks/node-down
|
|
|
|
- alert: AnchorBacklogHigh
|
|
expr: guardian_anchor_queue_depth > 100
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Anchor queue depth is {{ $value }}"
|
|
|
|
- alert: SLOBudgetBurning
|
|
expr: slo_error_budget_remaining_percent < 25
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "SLO {{ $labels.slo }} error budget at {{ $value }}%"
|
|
```
|
|
|
|
### 7.2 Notification Channels
|
|
|
|
| Severity | Channels | Response Time |
|
|
| ----------- | ------------------------------------- | ------------- |
|
|
| `critical` | PagerDuty, SMS, Slack #critical | Immediate |
|
|
| `high` | PagerDuty, Slack #alerts | 15 minutes |
|
|
| `warning` | Slack #alerts, Email | 1 hour |
|
|
| `info` | Slack #ops | Best effort |
|
|
|
|
---
|
|
|
|
## 8. Integration Points
|
|
|
|
| System | Integration |
|
|
| ---------------- | ------------------------------------------------------------------------ |
|
|
| **Guardian** | Emits anchor metrics/traces; alerts on anchor failures |
|
|
| **Treasury** | Transaction metrics; latency SLOs; receipt throughput |
|
|
| **Identity** | Auth event logs; failed login alerts; session metrics |
|
|
| **Mesh** | Node health metrics; route latency; topology change logs |
|
|
| **OffSec** | Security event correlation; incident timeline enrichment |
|
|
| **Oracle** | Query latency metrics; confidence score distributions |
|
|
| **Automation** | Workflow execution traces; n8n performance metrics |
|
|
|
|
---
|
|
|
|
## 9. Future Extensions
|
|
|
|
- **AI-powered anomaly detection**: ML models for predictive alerting
|
|
- **Distributed tracing visualization**: Real-time trace graphs in Portal
|
|
- **Log pattern mining**: Automatic extraction of error patterns
|
|
- **Chaos engineering integration**: Correlate chaos experiments with observability
|
|
- **Cost attribution**: Resource usage per scroll/service for Treasury billing
|
|
- **Compliance dashboards**: Real-time compliance posture visualization
|