# Mesh Observatory Prometheus + Grafana monitoring stack for Cloudflare infrastructure state. ## Components | Component | Port | Description | |-----------|------|-------------| | Prometheus | 9090 | Metrics collection and storage | | Grafana | 3000 | Visualization dashboards | | Metrics Exporter | 9100 | Custom Cloudflare metrics | ## Quick Start ### 1. Configure Environment ```bash cp .env.example .env # Edit .env with your credentials ``` Required environment variables: ``` CLOUDFLARE_API_TOKEN= CLOUDFLARE_ZONE_ID= CLOUDFLARE_ACCOUNT_ID= GRAFANA_PASSWORD= ``` ### 2. Start Stack ```bash docker-compose up -d ``` ### 3. Access Dashboards - Grafana: http://localhost:3000 (admin / $GRAFANA_PASSWORD) - Prometheus: http://localhost:9090 ## Dashboards | Dashboard | UID | Description | |-----------|-----|-------------| | Cloudflare Mesh Overview | cf-overview | Main command center | | DNS Health | cf-dns | DNS records, DNSSEC, types | | Tunnel Status | cf-tunnel | Tunnel health, connections | | Invariants & Compliance | cf-invariants | Invariant pass/fail, anomalies | | Security Settings | cf-security | SSL, TLS, Access apps | | ProofChain & Anchors | cf-proofchain | Merkle roots, snapshot freshness | ## Metrics Reference ### DNS Metrics - `cloudflare_dns_records_total` - Total DNS records - `cloudflare_dns_records_proxied` - Proxied records count - `cloudflare_dns_records_unproxied` - DNS-only records count - `cloudflare_dns_records_by_type{type="A|AAAA|CNAME|..."}` - Records by type - `cloudflare_dnssec_enabled` - DNSSEC status (0/1) ### Tunnel Metrics - `cloudflare_tunnels_total` - Total active tunnels - `cloudflare_tunnels_healthy` - Tunnels with active connections - `cloudflare_tunnels_unhealthy` - Tunnels without connections - `cloudflare_tunnel_connections_total` - Total tunnel connections ### Zone Settings - `cloudflare_zone_ssl_strict` - SSL mode is strict (0/1) - `cloudflare_zone_tls_version_secure` - TLS 1.2+ enforced (0/1) - `cloudflare_zone_always_https` - HTTPS redirect enabled (0/1) - `cloudflare_zone_browser_check` - Browser integrity check (0/1) ### Access Metrics - `cloudflare_access_apps_total` - Total Access applications - `cloudflare_access_apps_by_type{type="..."}` - Apps by type ### Invariant Metrics - `cloudflare_invariants_total` - Total invariant checks - `cloudflare_invariants_passed` - Passing invariants - `cloudflare_invariants_failed` - Failing invariants - `cloudflare_invariants_pass_rate` - Pass percentage - `cloudflare_invariant_report_age_seconds` - Report freshness ### Snapshot Metrics - `cloudflare_snapshot_age_seconds` - Seconds since last snapshot - `cloudflare_snapshot_merkle_root_set` - Merkle root present (0/1) ### Anomaly Metrics - `cloudflare_anomalies_total` - Total anomaly receipts - `cloudflare_anomalies_last_24h` - Recent anomalies ## Drift Visualizer Standalone tool for comparing state sources. ### Usage ```bash python3 drift-visualizer.py \ --snapshot ../snapshots/cloudflare-latest.json \ --manifest ../cloudflare_dns_manifest.md \ --output-dir ../reports ``` ### Output - `drift-report-.json` - Machine-readable diff - `drift-report-.html` - Visual HTML report ## Directory Structure ``` observatory/ ├── docker-compose.yml # Stack definition ├── Dockerfile.exporter # Metrics exporter container ├── prometheus.yml # Prometheus config ├── metrics-exporter.py # Custom exporter ├── drift-visualizer.py # Drift analysis tool ├── datasources/ # Grafana datasource provisioning │ └── prometheus.yml ├── dashboards/ # Grafana dashboard provisioning │ ├── dashboards.yml │ ├── cloudflare-overview.json │ ├── dns-health.json │ ├── tunnel-status.json │ ├── invariants.json │ ├── security-settings.json │ └── proofchain.json └── rules/ # Prometheus alerting rules (optional) ``` ## Integration with CI/CD The metrics exporter reads from: - `../snapshots/` - State snapshots from state-reconciler.py - `../anomalies/` - Anomaly receipts from invariant-checker.py Ensure these directories are populated by the GitLab CI pipeline or systemd services. ## Alerting (Optional) Create alerting rules in `rules/alerts.yml`: ```yaml groups: - name: cloudflare rules: - alert: InvariantFailure expr: cloudflare_invariants_failed > 0 for: 5m labels: severity: critical annotations: summary: "Cloudflare invariant check failing" - alert: TunnelUnhealthy expr: cloudflare_tunnels_unhealthy > 0 for: 5m labels: severity: warning annotations: summary: "Cloudflare tunnel has no connections" - alert: SnapshotStale expr: cloudflare_snapshot_age_seconds > 7200 for: 10m labels: severity: warning annotations: summary: "Cloudflare state snapshot older than 2 hours" ```