Initial commit: Cloudflare infrastructure with WAF Intelligence
- Complete Cloudflare Terraform configuration (DNS, WAF, tunnels, access) - WAF Intelligence MCP server with threat analysis and ML classification - GitOps automation with PR workflows and drift detection - Observatory monitoring stack with Prometheus/Grafana - IDE operator rules for governed development - Security playbooks and compliance frameworks - Autonomous remediation and state reconciliation
This commit is contained in:
26
observatory/.env.example
Normal file
26
observatory/.env.example
Normal file
@@ -0,0 +1,26 @@
|
||||
# Cloudflare Mesh Observatory Environment
|
||||
# Copy to .env and fill in values
|
||||
|
||||
# Cloudflare API Credentials
|
||||
CLOUDFLARE_API_TOKEN=
|
||||
CLOUDFLARE_ZONE_ID=
|
||||
CLOUDFLARE_ACCOUNT_ID=
|
||||
|
||||
# Grafana Admin Password
|
||||
GRAFANA_PASSWORD=changeme
|
||||
|
||||
# ==============================================
|
||||
# Phase 5B - Alerting Configuration
|
||||
# ==============================================
|
||||
|
||||
# Slack Integration
|
||||
# Create incoming webhook: https://api.slack.com/messaging/webhooks
|
||||
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/XXX/YYY/ZZZ
|
||||
|
||||
# PagerDuty Integration
|
||||
# Create service integration: https://support.pagerduty.com/docs/services-and-integrations
|
||||
PAGERDUTY_SERVICE_KEY=
|
||||
|
||||
# Email (SMTP) Settings
|
||||
SMTP_USERNAME=
|
||||
SMTP_PASSWORD=
|
||||
19
observatory/Dockerfile.exporter
Normal file
19
observatory/Dockerfile.exporter
Normal file
@@ -0,0 +1,19 @@
|
||||
# Cloudflare Metrics Exporter Container
|
||||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install dependencies
|
||||
RUN pip install --no-cache-dir requests
|
||||
|
||||
# Copy exporter script
|
||||
COPY metrics-exporter.py /app/
|
||||
|
||||
# Non-root user
|
||||
RUN useradd -r -s /sbin/nologin exporter
|
||||
USER exporter
|
||||
|
||||
EXPOSE 9100
|
||||
|
||||
ENTRYPOINT ["python3", "/app/metrics-exporter.py"]
|
||||
CMD ["--port", "9100"]
|
||||
171
observatory/README.md
Normal file
171
observatory/README.md
Normal file
@@ -0,0 +1,171 @@
|
||||
# Mesh Observatory
|
||||
|
||||
Prometheus + Grafana monitoring stack for Cloudflare infrastructure state.
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Port | Description |
|
||||
|-----------|------|-------------|
|
||||
| Prometheus | 9090 | Metrics collection and storage |
|
||||
| Grafana | 3000 | Visualization dashboards |
|
||||
| Metrics Exporter | 9100 | Custom Cloudflare metrics |
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Configure Environment
|
||||
|
||||
```bash
|
||||
cp .env.example .env
|
||||
# Edit .env with your credentials
|
||||
```
|
||||
|
||||
Required environment variables:
|
||||
```
|
||||
CLOUDFLARE_API_TOKEN=<your-token>
|
||||
CLOUDFLARE_ZONE_ID=<your-zone-id>
|
||||
CLOUDFLARE_ACCOUNT_ID=<your-account-id>
|
||||
GRAFANA_PASSWORD=<secure-password>
|
||||
```
|
||||
|
||||
### 2. Start Stack
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### 3. Access Dashboards
|
||||
|
||||
- Grafana: http://localhost:3000 (admin / $GRAFANA_PASSWORD)
|
||||
- Prometheus: http://localhost:9090
|
||||
|
||||
## Dashboards
|
||||
|
||||
| Dashboard | UID | Description |
|
||||
|-----------|-----|-------------|
|
||||
| Cloudflare Mesh Overview | cf-overview | Main command center |
|
||||
| DNS Health | cf-dns | DNS records, DNSSEC, types |
|
||||
| Tunnel Status | cf-tunnel | Tunnel health, connections |
|
||||
| Invariants & Compliance | cf-invariants | Invariant pass/fail, anomalies |
|
||||
| Security Settings | cf-security | SSL, TLS, Access apps |
|
||||
| ProofChain & Anchors | cf-proofchain | Merkle roots, snapshot freshness |
|
||||
|
||||
## Metrics Reference
|
||||
|
||||
### DNS Metrics
|
||||
- `cloudflare_dns_records_total` - Total DNS records
|
||||
- `cloudflare_dns_records_proxied` - Proxied records count
|
||||
- `cloudflare_dns_records_unproxied` - DNS-only records count
|
||||
- `cloudflare_dns_records_by_type{type="A|AAAA|CNAME|..."}` - Records by type
|
||||
- `cloudflare_dnssec_enabled` - DNSSEC status (0/1)
|
||||
|
||||
### Tunnel Metrics
|
||||
- `cloudflare_tunnels_total` - Total active tunnels
|
||||
- `cloudflare_tunnels_healthy` - Tunnels with active connections
|
||||
- `cloudflare_tunnels_unhealthy` - Tunnels without connections
|
||||
- `cloudflare_tunnel_connections_total` - Total tunnel connections
|
||||
|
||||
### Zone Settings
|
||||
- `cloudflare_zone_ssl_strict` - SSL mode is strict (0/1)
|
||||
- `cloudflare_zone_tls_version_secure` - TLS 1.2+ enforced (0/1)
|
||||
- `cloudflare_zone_always_https` - HTTPS redirect enabled (0/1)
|
||||
- `cloudflare_zone_browser_check` - Browser integrity check (0/1)
|
||||
|
||||
### Access Metrics
|
||||
- `cloudflare_access_apps_total` - Total Access applications
|
||||
- `cloudflare_access_apps_by_type{type="..."}` - Apps by type
|
||||
|
||||
### Invariant Metrics
|
||||
- `cloudflare_invariants_total` - Total invariant checks
|
||||
- `cloudflare_invariants_passed` - Passing invariants
|
||||
- `cloudflare_invariants_failed` - Failing invariants
|
||||
- `cloudflare_invariants_pass_rate` - Pass percentage
|
||||
- `cloudflare_invariant_report_age_seconds` - Report freshness
|
||||
|
||||
### Snapshot Metrics
|
||||
- `cloudflare_snapshot_age_seconds` - Seconds since last snapshot
|
||||
- `cloudflare_snapshot_merkle_root_set` - Merkle root present (0/1)
|
||||
|
||||
### Anomaly Metrics
|
||||
- `cloudflare_anomalies_total` - Total anomaly receipts
|
||||
- `cloudflare_anomalies_last_24h` - Recent anomalies
|
||||
|
||||
## Drift Visualizer
|
||||
|
||||
Standalone tool for comparing state sources.
|
||||
|
||||
### Usage
|
||||
|
||||
```bash
|
||||
python3 drift-visualizer.py \
|
||||
--snapshot ../snapshots/cloudflare-latest.json \
|
||||
--manifest ../cloudflare_dns_manifest.md \
|
||||
--output-dir ../reports
|
||||
```
|
||||
|
||||
### Output
|
||||
|
||||
- `drift-report-<timestamp>.json` - Machine-readable diff
|
||||
- `drift-report-<timestamp>.html` - Visual HTML report
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
observatory/
|
||||
├── docker-compose.yml # Stack definition
|
||||
├── Dockerfile.exporter # Metrics exporter container
|
||||
├── prometheus.yml # Prometheus config
|
||||
├── metrics-exporter.py # Custom exporter
|
||||
├── drift-visualizer.py # Drift analysis tool
|
||||
├── datasources/ # Grafana datasource provisioning
|
||||
│ └── prometheus.yml
|
||||
├── dashboards/ # Grafana dashboard provisioning
|
||||
│ ├── dashboards.yml
|
||||
│ ├── cloudflare-overview.json
|
||||
│ ├── dns-health.json
|
||||
│ ├── tunnel-status.json
|
||||
│ ├── invariants.json
|
||||
│ ├── security-settings.json
|
||||
│ └── proofchain.json
|
||||
└── rules/ # Prometheus alerting rules (optional)
|
||||
```
|
||||
|
||||
## Integration with CI/CD
|
||||
|
||||
The metrics exporter reads from:
|
||||
- `../snapshots/` - State snapshots from state-reconciler.py
|
||||
- `../anomalies/` - Anomaly receipts from invariant-checker.py
|
||||
|
||||
Ensure these directories are populated by the GitLab CI pipeline or systemd services.
|
||||
|
||||
## Alerting (Optional)
|
||||
|
||||
Create alerting rules in `rules/alerts.yml`:
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: cloudflare
|
||||
rules:
|
||||
- alert: InvariantFailure
|
||||
expr: cloudflare_invariants_failed > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Cloudflare invariant check failing"
|
||||
|
||||
- alert: TunnelUnhealthy
|
||||
expr: cloudflare_tunnels_unhealthy > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Cloudflare tunnel has no connections"
|
||||
|
||||
- alert: SnapshotStale
|
||||
expr: cloudflare_snapshot_age_seconds > 7200
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Cloudflare state snapshot older than 2 hours"
|
||||
```
|
||||
365
observatory/alertmanager/alertmanager.yml
Normal file
365
observatory/alertmanager/alertmanager.yml
Normal file
@@ -0,0 +1,365 @@
|
||||
# Alertmanager Configuration for Cloudflare Mesh Observatory
|
||||
# Phase 5B - Alerts & Escalation
|
||||
|
||||
global:
|
||||
# Default SMTP settings (override in receivers)
|
||||
smtp_smarthost: 'smtp.example.com:587'
|
||||
smtp_from: 'cloudflare-alerts@yourdomain.com'
|
||||
smtp_auth_username: '${SMTP_USERNAME}'
|
||||
smtp_auth_password: '${SMTP_PASSWORD}'
|
||||
smtp_require_tls: true
|
||||
|
||||
# Slack API URL (set via environment)
|
||||
slack_api_url: '${SLACK_WEBHOOK_URL}'
|
||||
|
||||
# PagerDuty integration key
|
||||
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
|
||||
|
||||
# Resolve timeout
|
||||
resolve_timeout: 5m
|
||||
|
||||
# Templates for notifications
|
||||
templates:
|
||||
- '/etc/alertmanager/templates/*.tmpl'
|
||||
|
||||
# Routing tree
|
||||
route:
|
||||
# Default receiver
|
||||
receiver: 'slack-default'
|
||||
|
||||
# Group alerts by these labels
|
||||
group_by: ['alertname', 'severity', 'component']
|
||||
|
||||
# Wait before sending first notification
|
||||
group_wait: 30s
|
||||
|
||||
# Wait before sending notification about new alerts in group
|
||||
group_interval: 5m
|
||||
|
||||
# Wait before re-sending notification
|
||||
repeat_interval: 4h
|
||||
|
||||
# Child routes for different severities and components
|
||||
routes:
|
||||
# ============================================
|
||||
# CRITICAL ALERTS - Immediate PagerDuty
|
||||
# ============================================
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: 'pagerduty-critical'
|
||||
group_wait: 10s
|
||||
repeat_interval: 1h
|
||||
continue: true # Also send to Slack
|
||||
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: 'slack-critical'
|
||||
group_wait: 10s
|
||||
|
||||
# ============================================
|
||||
# TUNNEL ALERTS
|
||||
# ============================================
|
||||
- match:
|
||||
component: tunnel
|
||||
receiver: 'slack-tunnels'
|
||||
routes:
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: 'pagerduty-critical'
|
||||
continue: true
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: 'slack-critical'
|
||||
|
||||
# ============================================
|
||||
# DNS ALERTS
|
||||
# ============================================
|
||||
- match:
|
||||
component: dns
|
||||
receiver: 'slack-dns'
|
||||
routes:
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: 'pagerduty-critical'
|
||||
continue: true
|
||||
- match:
|
||||
alertname: DNSHijackDetected
|
||||
receiver: 'pagerduty-critical'
|
||||
|
||||
# ============================================
|
||||
# WAF ALERTS
|
||||
# ============================================
|
||||
- match:
|
||||
component: waf
|
||||
receiver: 'slack-waf'
|
||||
routes:
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: 'pagerduty-critical'
|
||||
continue: true
|
||||
- match:
|
||||
alertname: WAFMassiveAttack
|
||||
receiver: 'pagerduty-critical'
|
||||
|
||||
# ============================================
|
||||
# INVARIANT ALERTS (Security Policy Violations)
|
||||
# ============================================
|
||||
- match:
|
||||
component: invariant
|
||||
receiver: 'slack-security'
|
||||
routes:
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: 'pagerduty-critical'
|
||||
continue: true
|
||||
|
||||
# ============================================
|
||||
# PROOFCHAIN ALERTS
|
||||
# ============================================
|
||||
- match:
|
||||
component: proofchain
|
||||
receiver: 'slack-proofchain'
|
||||
routes:
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: 'pagerduty-critical'
|
||||
|
||||
# ============================================
|
||||
# WARNING ALERTS - Slack only
|
||||
# ============================================
|
||||
- match:
|
||||
severity: warning
|
||||
receiver: 'slack-warnings'
|
||||
repeat_interval: 8h
|
||||
|
||||
# ============================================
|
||||
# INFO ALERTS - Daily digest
|
||||
# ============================================
|
||||
- match:
|
||||
severity: info
|
||||
receiver: 'email-daily'
|
||||
group_wait: 1h
|
||||
repeat_interval: 24h
|
||||
|
||||
# ============================================
|
||||
# PHASE 6 - GITOPS DRIFT REMEDIATION
|
||||
# Route drift alerts to GitOps webhook for auto-PR
|
||||
# ============================================
|
||||
- match:
|
||||
alertname: DNSDriftDetected
|
||||
receiver: 'gitops-drift-pr'
|
||||
continue: true # Also send to slack-dns
|
||||
|
||||
- match:
|
||||
alertname: WAFRuleMissing
|
||||
receiver: 'gitops-drift-pr'
|
||||
continue: true
|
||||
|
||||
- match:
|
||||
alertname: FirewallRuleMissing
|
||||
receiver: 'gitops-drift-pr'
|
||||
continue: true
|
||||
|
||||
- match:
|
||||
alertname: TunnelConfigChanged
|
||||
receiver: 'gitops-drift-pr'
|
||||
continue: true
|
||||
|
||||
- match_re:
|
||||
alertname: '.*(Drift|Mismatch|Changed).*'
|
||||
receiver: 'gitops-drift-pr'
|
||||
continue: true
|
||||
|
||||
# Inhibition rules - suppress lower severity when higher fires
|
||||
inhibit_rules:
|
||||
# If critical fires, suppress warning for same alert
|
||||
- source_match:
|
||||
severity: 'critical'
|
||||
target_match:
|
||||
severity: 'warning'
|
||||
equal: ['alertname', 'component']
|
||||
|
||||
# If warning fires, suppress info for same alert
|
||||
- source_match:
|
||||
severity: 'warning'
|
||||
target_match:
|
||||
severity: 'info'
|
||||
equal: ['alertname', 'component']
|
||||
|
||||
# Suppress all tunnel alerts if Cloudflare API is down
|
||||
- source_match:
|
||||
alertname: 'CloudflareAPIDown'
|
||||
target_match:
|
||||
component: 'tunnel'
|
||||
equal: []
|
||||
|
||||
# Suppress DNS alerts during planned maintenance
|
||||
- source_match:
|
||||
alertname: 'PlannedMaintenance'
|
||||
target_match:
|
||||
component: 'dns'
|
||||
equal: []
|
||||
|
||||
# Receivers definition
|
||||
receivers:
|
||||
# ============================================
|
||||
# SLACK RECEIVERS
|
||||
# ============================================
|
||||
- name: 'slack-default'
|
||||
slack_configs:
|
||||
- channel: '#cloudflare-alerts'
|
||||
send_resolved: true
|
||||
title: '{{ template "slack.cloudflare.title" . }}'
|
||||
text: '{{ template "slack.cloudflare.text" . }}'
|
||||
color: '{{ template "slack.cloudflare.color" . }}'
|
||||
actions:
|
||||
- type: button
|
||||
text: 'Runbook'
|
||||
url: '{{ template "slack.cloudflare.runbook" . }}'
|
||||
- type: button
|
||||
text: 'Grafana'
|
||||
url: 'http://localhost:3000/d/cloudflare-overview'
|
||||
|
||||
- name: 'slack-critical'
|
||||
slack_configs:
|
||||
- channel: '#cloudflare-critical'
|
||||
send_resolved: true
|
||||
title: '{{ template "slack.cloudflare.title" . }}'
|
||||
text: '{{ template "slack.cloudflare.text" . }}'
|
||||
color: 'danger'
|
||||
actions:
|
||||
- type: button
|
||||
text: 'Runbook'
|
||||
url: '{{ template "slack.cloudflare.runbook" . }}'
|
||||
- type: button
|
||||
text: 'Grafana'
|
||||
url: 'http://localhost:3000/d/cloudflare-overview'
|
||||
|
||||
- name: 'slack-warnings'
|
||||
slack_configs:
|
||||
- channel: '#cloudflare-alerts'
|
||||
send_resolved: true
|
||||
title: '{{ template "slack.cloudflare.title" . }}'
|
||||
text: '{{ template "slack.cloudflare.text" . }}'
|
||||
color: 'warning'
|
||||
|
||||
- name: 'slack-tunnels'
|
||||
slack_configs:
|
||||
- channel: '#cloudflare-tunnels'
|
||||
send_resolved: true
|
||||
title: '{{ template "slack.cloudflare.title" . }}'
|
||||
text: '{{ template "slack.cloudflare.text" . }}'
|
||||
color: '{{ template "slack.cloudflare.color" . }}'
|
||||
actions:
|
||||
- type: button
|
||||
text: 'Tunnel Playbook'
|
||||
url: 'https://wiki.internal/playbooks/tunnel-rotation'
|
||||
- type: button
|
||||
text: 'Tunnel Dashboard'
|
||||
url: 'http://localhost:3000/d/tunnel-status'
|
||||
|
||||
- name: 'slack-dns'
|
||||
slack_configs:
|
||||
- channel: '#cloudflare-dns'
|
||||
send_resolved: true
|
||||
title: '{{ template "slack.cloudflare.title" . }}'
|
||||
text: '{{ template "slack.cloudflare.text" . }}'
|
||||
color: '{{ template "slack.cloudflare.color" . }}'
|
||||
actions:
|
||||
- type: button
|
||||
text: 'DNS Playbook'
|
||||
url: 'https://wiki.internal/playbooks/dns-compromise'
|
||||
- type: button
|
||||
text: 'DNS Dashboard'
|
||||
url: 'http://localhost:3000/d/dns-health'
|
||||
|
||||
- name: 'slack-waf'
|
||||
slack_configs:
|
||||
- channel: '#cloudflare-waf'
|
||||
send_resolved: true
|
||||
title: '{{ template "slack.cloudflare.title" . }}'
|
||||
text: '{{ template "slack.cloudflare.text" . }}'
|
||||
color: '{{ template "slack.cloudflare.color" . }}'
|
||||
actions:
|
||||
- type: button
|
||||
text: 'WAF Playbook'
|
||||
url: 'https://wiki.internal/playbooks/waf-incident'
|
||||
- type: button
|
||||
text: 'WAF Dashboard'
|
||||
url: 'http://localhost:3000/d/security-settings'
|
||||
|
||||
- name: 'slack-security'
|
||||
slack_configs:
|
||||
- channel: '#cloudflare-security'
|
||||
send_resolved: true
|
||||
title: '{{ template "slack.cloudflare.title" . }}'
|
||||
text: '{{ template "slack.cloudflare.text" . }}'
|
||||
color: '{{ template "slack.cloudflare.color" . }}'
|
||||
actions:
|
||||
- type: button
|
||||
text: 'Invariants Dashboard'
|
||||
url: 'http://localhost:3000/d/invariants'
|
||||
|
||||
- name: 'slack-proofchain'
|
||||
slack_configs:
|
||||
- channel: '#cloudflare-proofchain'
|
||||
send_resolved: true
|
||||
title: '{{ template "slack.cloudflare.title" . }}'
|
||||
text: '{{ template "slack.cloudflare.text" . }}'
|
||||
color: '{{ template "slack.cloudflare.color" . }}'
|
||||
actions:
|
||||
- type: button
|
||||
text: 'Proofchain Dashboard'
|
||||
url: 'http://localhost:3000/d/proofchain'
|
||||
|
||||
# ============================================
|
||||
# PAGERDUTY RECEIVERS
|
||||
# ============================================
|
||||
- name: 'pagerduty-critical'
|
||||
pagerduty_configs:
|
||||
- service_key: '${PAGERDUTY_SERVICE_KEY}'
|
||||
send_resolved: true
|
||||
description: '{{ template "pagerduty.cloudflare.description" . }}'
|
||||
severity: 'critical'
|
||||
client: 'Cloudflare Mesh Observatory'
|
||||
client_url: 'http://localhost:3000'
|
||||
details:
|
||||
alertname: '{{ .GroupLabels.alertname }}'
|
||||
component: '{{ .GroupLabels.component }}'
|
||||
severity: '{{ .GroupLabels.severity }}'
|
||||
summary: '{{ .CommonAnnotations.summary }}'
|
||||
runbook: '{{ .CommonAnnotations.runbook_url }}'
|
||||
|
||||
# ============================================
|
||||
# EMAIL RECEIVERS
|
||||
# ============================================
|
||||
- name: 'email-daily'
|
||||
email_configs:
|
||||
- to: 'cloudflare-team@yourdomain.com'
|
||||
send_resolved: true
|
||||
html: '{{ template "email.cloudflare.html" . }}'
|
||||
headers:
|
||||
Subject: '[Cloudflare] Daily Alert Digest - {{ .Status | toUpper }}'
|
||||
|
||||
# ============================================
|
||||
# WEBHOOK RECEIVERS (for custom integrations)
|
||||
# ============================================
|
||||
- name: 'webhook-remediation'
|
||||
webhook_configs:
|
||||
- url: 'http://autonomous-remediator:8080/webhook/alert'
|
||||
send_resolved: true
|
||||
max_alerts: 10
|
||||
|
||||
# ============================================
|
||||
# PHASE 6 - GITOPS WEBHOOK RECEIVER
|
||||
# ============================================
|
||||
- name: 'gitops-drift-pr'
|
||||
webhook_configs:
|
||||
- url: '${GITOPS_WEBHOOK_URL:-http://gitops-webhook:8080/webhook/alert}'
|
||||
send_resolved: false # Only fire on new alerts, not resolved
|
||||
max_alerts: 5
|
||||
http_config:
|
||||
# Optional: Add bearer token or basic auth
|
||||
# authorization:
|
||||
# type: Bearer
|
||||
# credentials: '${GITOPS_WEBHOOK_TOKEN}'
|
||||
326
observatory/alertmanager/templates/email.tmpl
Normal file
326
observatory/alertmanager/templates/email.tmpl
Normal file
@@ -0,0 +1,326 @@
|
||||
{{/* Email notification templates for Cloudflare Mesh Observatory */}}
|
||||
|
||||
{{/* HTML email template */}}
|
||||
{{ define "email.cloudflare.html" }}
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<style>
|
||||
body {
|
||||
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
|
||||
line-height: 1.6;
|
||||
color: #333;
|
||||
max-width: 800px;
|
||||
margin: 0 auto;
|
||||
padding: 20px;
|
||||
}
|
||||
.header {
|
||||
background: linear-gradient(135deg, #F6821F 0%, #F38020 100%);
|
||||
color: white;
|
||||
padding: 20px;
|
||||
border-radius: 8px 8px 0 0;
|
||||
text-align: center;
|
||||
}
|
||||
.header h1 {
|
||||
margin: 0;
|
||||
font-size: 24px;
|
||||
}
|
||||
.status-badge {
|
||||
display: inline-block;
|
||||
padding: 4px 12px;
|
||||
border-radius: 20px;
|
||||
font-size: 12px;
|
||||
font-weight: bold;
|
||||
text-transform: uppercase;
|
||||
margin-top: 10px;
|
||||
}
|
||||
.status-firing { background: #dc3545; color: white; }
|
||||
.status-resolved { background: #28a745; color: white; }
|
||||
.content {
|
||||
background: #fff;
|
||||
border: 1px solid #e0e0e0;
|
||||
border-top: none;
|
||||
padding: 20px;
|
||||
border-radius: 0 0 8px 8px;
|
||||
}
|
||||
.alert-card {
|
||||
background: #f8f9fa;
|
||||
border-left: 4px solid #F6821F;
|
||||
padding: 15px;
|
||||
margin: 15px 0;
|
||||
border-radius: 0 4px 4px 0;
|
||||
}
|
||||
.alert-card.critical { border-left-color: #dc3545; }
|
||||
.alert-card.warning { border-left-color: #ffc107; }
|
||||
.alert-card.info { border-left-color: #17a2b8; }
|
||||
.alert-card.resolved { border-left-color: #28a745; }
|
||||
.alert-title {
|
||||
font-size: 16px;
|
||||
font-weight: bold;
|
||||
color: #333;
|
||||
margin-bottom: 10px;
|
||||
}
|
||||
.alert-meta {
|
||||
font-size: 12px;
|
||||
color: #666;
|
||||
margin-bottom: 10px;
|
||||
}
|
||||
.alert-meta span {
|
||||
display: inline-block;
|
||||
margin-right: 15px;
|
||||
}
|
||||
.label {
|
||||
display: inline-block;
|
||||
background: #e9ecef;
|
||||
padding: 2px 8px;
|
||||
border-radius: 4px;
|
||||
font-size: 11px;
|
||||
margin: 2px;
|
||||
}
|
||||
.description {
|
||||
margin: 10px 0;
|
||||
padding: 10px;
|
||||
background: white;
|
||||
border-radius: 4px;
|
||||
}
|
||||
.runbook-link {
|
||||
display: inline-block;
|
||||
background: #F6821F;
|
||||
color: white;
|
||||
padding: 8px 16px;
|
||||
border-radius: 4px;
|
||||
text-decoration: none;
|
||||
font-size: 14px;
|
||||
margin-top: 10px;
|
||||
}
|
||||
.runbook-link:hover {
|
||||
background: #e67316;
|
||||
}
|
||||
.summary-table {
|
||||
width: 100%;
|
||||
border-collapse: collapse;
|
||||
margin: 20px 0;
|
||||
}
|
||||
.summary-table th, .summary-table td {
|
||||
padding: 10px;
|
||||
text-align: left;
|
||||
border-bottom: 1px solid #e0e0e0;
|
||||
}
|
||||
.summary-table th {
|
||||
background: #f8f9fa;
|
||||
font-weight: 600;
|
||||
}
|
||||
.footer {
|
||||
text-align: center;
|
||||
font-size: 12px;
|
||||
color: #888;
|
||||
margin-top: 20px;
|
||||
padding-top: 20px;
|
||||
border-top: 1px solid #e0e0e0;
|
||||
}
|
||||
.footer a {
|
||||
color: #F6821F;
|
||||
text-decoration: none;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="header">
|
||||
<h1>Cloudflare Mesh Observatory</h1>
|
||||
<span class="status-badge status-{{ .Status }}">{{ .Status }}</span>
|
||||
</div>
|
||||
|
||||
<div class="content">
|
||||
<h2>Alert Summary</h2>
|
||||
|
||||
<table class="summary-table">
|
||||
<tr>
|
||||
<th>Status</th>
|
||||
<td>{{ .Status | toUpper }}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Alert Name</th>
|
||||
<td>{{ .CommonLabels.alertname }}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Severity</th>
|
||||
<td>{{ .CommonLabels.severity | toUpper }}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Component</th>
|
||||
<td>{{ .CommonLabels.component }}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Firing Alerts</th>
|
||||
<td>{{ .Alerts.Firing | len }}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Resolved Alerts</th>
|
||||
<td>{{ .Alerts.Resolved | len }}</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<h2>Alert Details</h2>
|
||||
|
||||
{{ range .Alerts }}
|
||||
<div class="alert-card {{ .Labels.severity }}{{ if eq .Status "resolved" }} resolved{{ end }}">
|
||||
<div class="alert-title">
|
||||
{{ .Labels.alertname }}
|
||||
<span class="status-badge status-{{ .Status }}" style="font-size: 10px; padding: 2px 8px;">{{ .Status }}</span>
|
||||
</div>
|
||||
|
||||
<div class="alert-meta">
|
||||
<span><strong>Severity:</strong> {{ .Labels.severity }}</span>
|
||||
<span><strong>Component:</strong> {{ .Labels.component }}</span>
|
||||
<span><strong>Started:</strong> {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}</span>
|
||||
{{ if eq .Status "resolved" }}
|
||||
<span><strong>Resolved:</strong> {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}</span>
|
||||
{{ end }}
|
||||
</div>
|
||||
|
||||
<div class="description">
|
||||
<strong>Summary:</strong> {{ .Annotations.summary }}<br>
|
||||
<strong>Description:</strong> {{ .Annotations.description }}
|
||||
</div>
|
||||
|
||||
<div style="margin-top: 10px;">
|
||||
<strong>Labels:</strong><br>
|
||||
{{ range .Labels.SortedPairs }}
|
||||
<span class="label">{{ .Name }}: {{ .Value }}</span>
|
||||
{{ end }}
|
||||
</div>
|
||||
|
||||
{{ if .Annotations.runbook_url }}
|
||||
<a href="{{ .Annotations.runbook_url }}" class="runbook-link">View Runbook</a>
|
||||
{{ end }}
|
||||
</div>
|
||||
{{ end }}
|
||||
|
||||
<h2>Quick Links</h2>
|
||||
<ul>
|
||||
<li><a href="http://localhost:3000">Grafana Dashboard</a></li>
|
||||
<li><a href="http://localhost:9090">Prometheus</a></li>
|
||||
<li><a href="https://dash.cloudflare.com">Cloudflare Dashboard</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div class="footer">
|
||||
<p>
|
||||
This alert was generated by <strong>Cloudflare Mesh Observatory</strong><br>
|
||||
<a href="http://localhost:9093">Alertmanager</a> |
|
||||
<a href="http://localhost:3000">Grafana</a> |
|
||||
<a href="http://localhost:9090">Prometheus</a>
|
||||
</p>
|
||||
<p>
|
||||
Generated at {{ .ExternalURL }}
|
||||
</p>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
{{ end }}
|
||||
|
||||
{{/* Plain text email template */}}
|
||||
{{ define "email.cloudflare.text" }}
|
||||
================================================================================
|
||||
CLOUDFLARE MESH OBSERVATORY - ALERT {{ .Status | toUpper }}
|
||||
================================================================================
|
||||
|
||||
Status: {{ .Status | toUpper }}
|
||||
Alert: {{ .CommonLabels.alertname }}
|
||||
Severity: {{ .CommonLabels.severity | toUpper }}
|
||||
Component: {{ .CommonLabels.component }}
|
||||
|
||||
Firing: {{ .Alerts.Firing | len }} alerts
|
||||
Resolved: {{ .Alerts.Resolved | len }} alerts
|
||||
|
||||
================================================================================
|
||||
ALERT DETAILS
|
||||
================================================================================
|
||||
|
||||
{{ range .Alerts }}
|
||||
--------------------------------------------------------------------------------
|
||||
{{ .Labels.alertname }} [{{ .Status | toUpper }}]
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Severity: {{ .Labels.severity }}
|
||||
Component: {{ .Labels.component }}
|
||||
Started: {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
|
||||
{{ if eq .Status "resolved" }}Resolved: {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}{{ end }}
|
||||
|
||||
Summary: {{ .Annotations.summary }}
|
||||
|
||||
Description: {{ .Annotations.description }}
|
||||
|
||||
Labels:
|
||||
{{ range .Labels.SortedPairs }} - {{ .Name }}: {{ .Value }}
|
||||
{{ end }}
|
||||
|
||||
{{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }}
|
||||
|
||||
{{ end }}
|
||||
|
||||
================================================================================
|
||||
QUICK LINKS
|
||||
================================================================================
|
||||
|
||||
Grafana: http://localhost:3000
|
||||
Prometheus: http://localhost:9090
|
||||
Alertmanager: http://localhost:9093
|
||||
Cloudflare: https://dash.cloudflare.com
|
||||
|
||||
================================================================================
|
||||
Generated by Cloudflare Mesh Observatory
|
||||
{{ end }}
|
||||
|
||||
{{/* Daily digest email template */}}
|
||||
{{ define "email.cloudflare.digest" }}
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<style>
|
||||
/* Same styles as above */
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="header">
|
||||
<h1>Daily Alert Digest</h1>
|
||||
<p>{{ now.Format "Monday, January 2, 2006" }}</p>
|
||||
</div>
|
||||
|
||||
<div class="content">
|
||||
<h2>24-Hour Summary</h2>
|
||||
|
||||
<table class="summary-table">
|
||||
<tr>
|
||||
<th>Metric</th>
|
||||
<th>Count</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Total Alerts</td>
|
||||
<td>{{ len .Alerts }}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Currently Firing</td>
|
||||
<td>{{ .Alerts.Firing | len }}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Resolved</td>
|
||||
<td>{{ .Alerts.Resolved | len }}</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<h2>Alerts by Severity</h2>
|
||||
<!-- Alert breakdown would go here -->
|
||||
|
||||
<h2>Alerts by Component</h2>
|
||||
<!-- Component breakdown would go here -->
|
||||
</div>
|
||||
|
||||
<div class="footer">
|
||||
<p>This is an automated daily digest from Cloudflare Mesh Observatory</p>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
{{ end }}
|
||||
169
observatory/alertmanager/templates/pagerduty.tmpl
Normal file
169
observatory/alertmanager/templates/pagerduty.tmpl
Normal file
@@ -0,0 +1,169 @@
|
||||
{{/* PagerDuty notification templates for Cloudflare Mesh Observatory */}}
|
||||
|
||||
{{/* Main description template */}}
|
||||
{{ define "pagerduty.cloudflare.description" -}}
|
||||
[{{ .CommonLabels.severity | toUpper }}] {{ .CommonLabels.alertname }} - {{ .CommonAnnotations.summary }}
|
||||
{{- end }}
|
||||
|
||||
{{/* Detailed incident description */}}
|
||||
{{ define "pagerduty.cloudflare.details" -}}
|
||||
{{ range .Alerts }}
|
||||
Alert: {{ .Labels.alertname }}
|
||||
Severity: {{ .Labels.severity }}
|
||||
Component: {{ .Labels.component }}
|
||||
|
||||
Summary: {{ .Annotations.summary }}
|
||||
|
||||
Description: {{ .Annotations.description }}
|
||||
|
||||
Labels:
|
||||
{{ range .Labels.SortedPairs -}}
|
||||
{{ .Name }}: {{ .Value }}
|
||||
{{ end }}
|
||||
|
||||
Started: {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
|
||||
{{ if eq .Status "resolved" }}Resolved: {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}{{ end }}
|
||||
|
||||
Runbook: {{ if .Annotations.runbook_url }}{{ .Annotations.runbook_url }}{{ else }}https://wiki.internal/playbooks/cloudflare{{ end }}
|
||||
|
||||
---
|
||||
{{ end }}
|
||||
{{- end }}
|
||||
|
||||
{{/* Critical tunnel incident */}}
|
||||
{{ define "pagerduty.cloudflare.tunnel.critical" -}}
|
||||
CRITICAL TUNNEL FAILURE
|
||||
|
||||
Tunnel: {{ .CommonLabels.tunnel_name }} ({{ .CommonLabels.tunnel_id }})
|
||||
Zone: {{ .CommonLabels.zone }}
|
||||
|
||||
All tunnel connections have failed. Services behind this tunnel are UNREACHABLE.
|
||||
|
||||
Immediate Actions Required:
|
||||
1. Check cloudflared daemon status on origin server
|
||||
2. Verify network path to Cloudflare edge
|
||||
3. Review recent configuration changes
|
||||
4. Consider emergency tunnel rotation
|
||||
|
||||
Impact: {{ .CommonAnnotations.impact }}
|
||||
ETA to degradation: IMMEDIATE
|
||||
|
||||
Escalation Chain:
|
||||
1. On-call Infrastructure Engineer
|
||||
2. Platform Team Lead
|
||||
3. Security Team (if compromise suspected)
|
||||
{{- end }}
|
||||
|
||||
{{/* Critical DNS incident */}}
|
||||
{{ define "pagerduty.cloudflare.dns.critical" -}}
|
||||
CRITICAL DNS INCIDENT
|
||||
|
||||
Type: {{ .CommonLabels.alertname }}
|
||||
Zone: {{ .CommonLabels.zone }}
|
||||
Record: {{ .CommonLabels.record_name }}
|
||||
|
||||
{{ if eq .CommonLabels.alertname "DNSHijackDetected" -}}
|
||||
POTENTIAL DNS HIJACK DETECTED
|
||||
|
||||
This is a SECURITY INCIDENT. DNS records do not match expected configuration.
|
||||
|
||||
Immediate Actions:
|
||||
1. Verify DNS resolution from multiple locations
|
||||
2. Check Cloudflare dashboard for unauthorized changes
|
||||
3. Review audit logs for suspicious activity
|
||||
4. Engage security incident response
|
||||
|
||||
DO NOT dismiss without verification.
|
||||
{{- else -}}
|
||||
DNS configuration drift detected. Records have changed from expected baseline.
|
||||
|
||||
Actions:
|
||||
1. Compare current vs expected records
|
||||
2. Determine if change was authorized
|
||||
3. Restore from known-good state if needed
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
|
||||
{{/* Critical WAF incident */}}
|
||||
{{ define "pagerduty.cloudflare.waf.critical" -}}
|
||||
CRITICAL WAF INCIDENT
|
||||
|
||||
Attack Type: {{ .CommonLabels.attack_type }}
|
||||
Source: {{ .CommonLabels.source_ip }}
|
||||
Request Volume: {{ .CommonLabels.request_count }} requests
|
||||
|
||||
{{ if eq .CommonLabels.alertname "WAFMassiveAttack" -}}
|
||||
MASSIVE ATTACK IN PROGRESS
|
||||
|
||||
Request volume significantly exceeds baseline. This may indicate:
|
||||
- DDoS attack
|
||||
- Credential stuffing
|
||||
- Application-layer attack
|
||||
|
||||
Immediate Actions:
|
||||
1. Review attack traffic patterns
|
||||
2. Consider enabling Under Attack Mode
|
||||
3. Increase rate limiting thresholds
|
||||
4. Block attacking IPs if identified
|
||||
|
||||
Current Mitigation: {{ .CommonAnnotations.current_mitigation }}
|
||||
{{- else -}}
|
||||
WAF rule bypass detected. Malicious traffic may be reaching origin.
|
||||
|
||||
Actions:
|
||||
1. Analyze bypassed requests
|
||||
2. Tighten rule specificity
|
||||
3. Add supplementary blocking rules
|
||||
{{- end }}
|
||||
{{- end }}
|
||||
|
||||
{{/* Critical invariant violation */}}
|
||||
{{ define "pagerduty.cloudflare.invariant.critical" -}}
|
||||
SECURITY INVARIANT VIOLATION
|
||||
|
||||
Invariant: {{ .CommonLabels.invariant_name }}
|
||||
Category: {{ .CommonLabels.category }}
|
||||
|
||||
A critical security invariant has been violated. This indicates:
|
||||
- Unauthorized configuration change
|
||||
- Potential security misconfiguration
|
||||
- Compliance violation
|
||||
|
||||
Violation Details:
|
||||
- Expected: {{ .CommonLabels.expected_value }}
|
||||
- Actual: {{ .CommonLabels.actual_value }}
|
||||
- Impact: {{ .CommonAnnotations.impact }}
|
||||
|
||||
Affected Frameworks: {{ .CommonLabels.frameworks }}
|
||||
|
||||
This violation requires immediate investigation and remediation.
|
||||
{{- end }}
|
||||
|
||||
{{/* Critical proofchain incident */}}
|
||||
{{ define "pagerduty.cloudflare.proofchain.critical" -}}
|
||||
PROOFCHAIN INTEGRITY FAILURE
|
||||
|
||||
Chain: {{ .CommonLabels.chain_name }}
|
||||
Receipt Type: {{ .CommonLabels.receipt_type }}
|
||||
|
||||
CRITICAL: Proofchain integrity verification has FAILED.
|
||||
|
||||
This indicates one of:
|
||||
1. Ledger tampering
|
||||
2. Receipt corruption
|
||||
3. Chain fork
|
||||
4. Hash collision (extremely unlikely)
|
||||
|
||||
Integrity Details:
|
||||
- Last Valid Hash: {{ .CommonLabels.last_valid_hash }}
|
||||
- Expected Hash: {{ .CommonLabels.expected_hash }}
|
||||
- Computed Hash: {{ .CommonLabels.computed_hash }}
|
||||
|
||||
IMMEDIATE ACTIONS:
|
||||
1. HALT all new receipt generation
|
||||
2. Preserve current state for forensics
|
||||
3. Identify last known-good checkpoint
|
||||
4. Engage proofchain administrator
|
||||
|
||||
This is a potential SECURITY INCIDENT if tampering is suspected.
|
||||
{{- end }}
|
||||
200
observatory/alertmanager/templates/slack.tmpl
Normal file
200
observatory/alertmanager/templates/slack.tmpl
Normal file
@@ -0,0 +1,200 @@
|
||||
{{/* Slack notification templates for Cloudflare Mesh Observatory */}}
|
||||
|
||||
{{/* Title template */}}
|
||||
{{ define "slack.cloudflare.title" -}}
|
||||
{{ if eq .Status "firing" }}{{ .Alerts.Firing | len }} FIRING{{ end }}{{ if and (eq .Status "resolved") (gt (.Alerts.Resolved | len) 0) }}{{ .Alerts.Resolved | len }} RESOLVED{{ end }} | {{ .CommonLabels.alertname }}
|
||||
{{- end }}
|
||||
|
||||
{{/* Color template based on severity */}}
|
||||
{{ define "slack.cloudflare.color" -}}
|
||||
{{ if eq .Status "resolved" }}good{{ else if eq .CommonLabels.severity "critical" }}danger{{ else if eq .CommonLabels.severity "warning" }}warning{{ else }}#439FE0{{ end }}
|
||||
{{- end }}
|
||||
|
||||
{{/* Main text body */}}
|
||||
{{ define "slack.cloudflare.text" -}}
|
||||
{{ range .Alerts }}
|
||||
*Alert:* {{ .Labels.alertname }}
|
||||
*Severity:* {{ .Labels.severity | toUpper }}
|
||||
*Component:* {{ .Labels.component }}
|
||||
*Status:* {{ .Status | toUpper }}
|
||||
|
||||
*Summary:* {{ .Annotations.summary }}
|
||||
|
||||
*Description:* {{ .Annotations.description }}
|
||||
|
||||
{{ if .Annotations.runbook_url }}*Runbook:* <{{ .Annotations.runbook_url }}|View Playbook>{{ end }}
|
||||
|
||||
*Labels:*
|
||||
{{ range .Labels.SortedPairs -}}
|
||||
- {{ .Name }}: `{{ .Value }}`
|
||||
{{ end }}
|
||||
|
||||
*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
|
||||
{{ if eq .Status "resolved" }}*Resolved:* {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}{{ end }}
|
||||
|
||||
---
|
||||
{{ end }}
|
||||
{{- end }}
|
||||
|
||||
{{/* Runbook URL template */}}
|
||||
{{ define "slack.cloudflare.runbook" -}}
|
||||
{{ if .CommonAnnotations.runbook_url }}{{ .CommonAnnotations.runbook_url }}{{ else }}https://wiki.internal/playbooks/cloudflare{{ end }}
|
||||
{{- end }}
|
||||
|
||||
{{/* Compact alert list for summary */}}
|
||||
{{ define "slack.cloudflare.alertlist" -}}
|
||||
{{ range . }}
|
||||
- {{ .Labels.alertname }} ({{ .Labels.severity }})
|
||||
{{ end }}
|
||||
{{- end }}
|
||||
|
||||
{{/* Tunnel-specific template */}}
|
||||
{{ define "slack.cloudflare.tunnel" -}}
|
||||
{{ range .Alerts }}
|
||||
*Tunnel Alert*
|
||||
|
||||
*Tunnel ID:* {{ .Labels.tunnel_id }}
|
||||
*Tunnel Name:* {{ .Labels.tunnel_name }}
|
||||
*Status:* {{ .Status | toUpper }}
|
||||
|
||||
{{ .Annotations.description }}
|
||||
|
||||
*Action Required:*
|
||||
{{ if eq .Labels.alertname "TunnelDown" }}
|
||||
1. Check cloudflared service status
|
||||
2. Verify network connectivity
|
||||
3. Run tunnel rotation if unrecoverable
|
||||
{{ else if eq .Labels.alertname "TunnelRotationDue" }}
|
||||
1. Schedule maintenance window
|
||||
2. Execute tunnel rotation protocol
|
||||
3. Verify new tunnel connectivity
|
||||
{{ end }}
|
||||
|
||||
---
|
||||
{{ end }}
|
||||
{{- end }}
|
||||
|
||||
{{/* DNS-specific template */}}
|
||||
{{ define "slack.cloudflare.dns" -}}
|
||||
{{ range .Alerts }}
|
||||
*DNS Alert*
|
||||
|
||||
*Record:* {{ .Labels.record_name }}
|
||||
*Type:* {{ .Labels.record_type }}
|
||||
*Zone:* {{ .Labels.zone }}
|
||||
*Status:* {{ .Status | toUpper }}
|
||||
|
||||
{{ .Annotations.description }}
|
||||
|
||||
*Immediate Actions:*
|
||||
{{ if eq .Labels.alertname "DNSHijackDetected" }}
|
||||
1. CRITICAL: Potential DNS hijack detected
|
||||
2. Immediately verify DNS resolution
|
||||
3. Check Cloudflare audit logs
|
||||
4. Engage incident response team
|
||||
{{ else if eq .Labels.alertname "DNSDriftDetected" }}
|
||||
1. Compare current vs expected records
|
||||
2. Check for unauthorized changes
|
||||
3. Run state reconciler if needed
|
||||
{{ end }}
|
||||
|
||||
---
|
||||
{{ end }}
|
||||
{{- end }}
|
||||
|
||||
{{/* WAF-specific template */}}
|
||||
{{ define "slack.cloudflare.waf" -}}
|
||||
{{ range .Alerts }}
|
||||
*WAF Alert*
|
||||
|
||||
*Rule ID:* {{ .Labels.rule_id }}
|
||||
*Action:* {{ .Labels.action }}
|
||||
*Source:* {{ .Labels.source_ip }}
|
||||
*Status:* {{ .Status | toUpper }}
|
||||
|
||||
{{ .Annotations.description }}
|
||||
|
||||
*Threat Intelligence:*
|
||||
- Request Count: {{ .Labels.request_count }}
|
||||
- Block Rate: {{ .Labels.block_rate }}%
|
||||
- Attack Type: {{ .Labels.attack_type }}
|
||||
|
||||
*Recommended Actions:*
|
||||
{{ if eq .Labels.alertname "WAFMassiveAttack" }}
|
||||
1. Verify attack is not false positive
|
||||
2. Consider enabling Under Attack Mode
|
||||
3. Review and adjust rate limiting
|
||||
4. Document attack patterns
|
||||
{{ else if eq .Labels.alertname "WAFRuleBypass" }}
|
||||
1. Analyze bypassed requests
|
||||
2. Tighten rule specificity
|
||||
3. Add supplementary rules
|
||||
{{ end }}
|
||||
|
||||
---
|
||||
{{ end }}
|
||||
{{- end }}
|
||||
|
||||
{{/* Security/Invariant template */}}
|
||||
{{ define "slack.cloudflare.security" -}}
|
||||
{{ range .Alerts }}
|
||||
*Security Invariant Violation*
|
||||
|
||||
*Invariant:* {{ .Labels.invariant_name }}
|
||||
*Category:* {{ .Labels.category }}
|
||||
*Status:* {{ .Status | toUpper }}
|
||||
|
||||
{{ .Annotations.description }}
|
||||
|
||||
*Violation Details:*
|
||||
- Expected: {{ .Labels.expected_value }}
|
||||
- Actual: {{ .Labels.actual_value }}
|
||||
- First Seen: {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
|
||||
|
||||
*Compliance Impact:*
|
||||
This violation may affect:
|
||||
{{ range split .Labels.frameworks "," -}}
|
||||
- {{ . }}
|
||||
{{ end }}
|
||||
|
||||
*Remediation Steps:*
|
||||
1. Review invariant definition
|
||||
2. Check for authorized changes
|
||||
3. Run autonomous remediator or manual fix
|
||||
4. Document change justification
|
||||
|
||||
---
|
||||
{{ end }}
|
||||
{{- end }}
|
||||
|
||||
{{/* Proofchain template */}}
|
||||
{{ define "slack.cloudflare.proofchain" -}}
|
||||
{{ range .Alerts }}
|
||||
*Proofchain Alert*
|
||||
|
||||
*Chain:* {{ .Labels.chain_name }}
|
||||
*Receipt Type:* {{ .Labels.receipt_type }}
|
||||
*Status:* {{ .Status | toUpper }}
|
||||
|
||||
{{ .Annotations.description }}
|
||||
|
||||
*Integrity Details:*
|
||||
- Last Valid Hash: {{ .Labels.last_valid_hash }}
|
||||
- Expected Hash: {{ .Labels.expected_hash }}
|
||||
- Computed Hash: {{ .Labels.computed_hash }}
|
||||
|
||||
*This indicates potential:*
|
||||
- Ledger tampering
|
||||
- Receipt corruption
|
||||
- Chain fork
|
||||
- Missing anchors
|
||||
|
||||
*Immediate Actions:*
|
||||
1. DO NOT write new receipts until resolved
|
||||
2. Identify last known-good state
|
||||
3. Investigate discrepancy source
|
||||
4. Contact proofchain administrator
|
||||
|
||||
---
|
||||
{{ end }}
|
||||
{{- end }}
|
||||
415
observatory/dashboards/cloudflare-overview.json
Normal file
415
observatory/dashboards/cloudflare-overview.json
Normal file
@@ -0,0 +1,415 @@
|
||||
{
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 0,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"liveNow": false,
|
||||
"panels": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "red", "value": 1}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
|
||||
"id": 1,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "10.2.2",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudflare_invariants_failed",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Invariant Failures",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"color": "green", "value": null}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
|
||||
"id": 2,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudflare_dns_records_total",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "DNS Records",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"color": "red", "value": null},
|
||||
{"color": "green", "value": 1}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
|
||||
"id": 3,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudflare_tunnels_healthy",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Healthy Tunnels",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "yellow", "value": 3600},
|
||||
{"color": "red", "value": 7200}
|
||||
]
|
||||
},
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
|
||||
"id": 4,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudflare_snapshot_age_seconds",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Snapshot Age",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [
|
||||
{"options": {"0": {"color": "red", "index": 0, "text": "OFF"}}, "type": "value"},
|
||||
{"options": {"1": {"color": "green", "index": 1, "text": "ON"}}, "type": "value"}
|
||||
],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [{"color": "green", "value": null}]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
|
||||
"id": 5,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudflare_dnssec_enabled",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "DNSSEC",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "yellow", "value": 1},
|
||||
{"color": "red", "value": 5}
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
|
||||
"id": 6,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudflare_anomalies_last_24h",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Anomalies (24h)",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "palette-classic"
|
||||
},
|
||||
"custom": {
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {"type": "linear"},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {"group": "A", "mode": "none"},
|
||||
"thresholdsStyle": {"mode": "off"}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [{"color": "green", "value": null}]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
|
||||
"id": 7,
|
||||
"options": {
|
||||
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
|
||||
"tooltip": {"mode": "single", "sort": "none"}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudflare_invariants_passed",
|
||||
"legendFormat": "Passed",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"expr": "cloudflare_invariants_failed",
|
||||
"legendFormat": "Failed",
|
||||
"refId": "B"
|
||||
}
|
||||
],
|
||||
"title": "Invariant Status Over Time",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "palette-classic"
|
||||
},
|
||||
"custom": {
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {"type": "linear"},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {"group": "A", "mode": "none"},
|
||||
"thresholdsStyle": {"mode": "off"}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [{"color": "green", "value": null}]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
|
||||
"id": 8,
|
||||
"options": {
|
||||
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
|
||||
"tooltip": {"mode": "single", "sort": "none"}
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudflare_tunnels_healthy",
|
||||
"legendFormat": "Healthy",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"expr": "cloudflare_tunnels_unhealthy",
|
||||
"legendFormat": "Unhealthy",
|
||||
"refId": "B"
|
||||
}
|
||||
],
|
||||
"title": "Tunnel Health Over Time",
|
||||
"type": "timeseries"
|
||||
}
|
||||
],
|
||||
"refresh": "1m",
|
||||
"schemaVersion": 38,
|
||||
"style": "dark",
|
||||
"tags": ["cloudflare", "mesh", "overview"],
|
||||
"templating": {
|
||||
"list": []
|
||||
},
|
||||
"time": {
|
||||
"from": "now-24h",
|
||||
"to": "now"
|
||||
},
|
||||
"timepicker": {},
|
||||
"timezone": "utc",
|
||||
"title": "Cloudflare Mesh Overview",
|
||||
"uid": "cf-overview",
|
||||
"version": 1,
|
||||
"weekStart": ""
|
||||
}
|
||||
14
observatory/dashboards/dashboards.yml
Normal file
14
observatory/dashboards/dashboards.yml
Normal file
@@ -0,0 +1,14 @@
|
||||
# Grafana Dashboard Provisioning
|
||||
apiVersion: 1
|
||||
|
||||
providers:
|
||||
- name: 'Cloudflare Mesh'
|
||||
orgId: 1
|
||||
folder: 'Cloudflare'
|
||||
folderUid: 'cloudflare'
|
||||
type: file
|
||||
disableDeletion: false
|
||||
updateIntervalSeconds: 30
|
||||
allowUiUpdates: true
|
||||
options:
|
||||
path: /etc/grafana/provisioning/dashboards
|
||||
195
observatory/dashboards/dns-health.json
Normal file
195
observatory/dashboards/dns-health.json
Normal file
@@ -0,0 +1,195 @@
|
||||
{
|
||||
"annotations": {"list": []},
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 0,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"liveNow": false,
|
||||
"panels": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
|
||||
"id": 1,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_dns_records_total", "refId": "A"}],
|
||||
"title": "Total Records",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "orange", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
|
||||
"id": 2,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_dns_records_proxied", "refId": "A"}],
|
||||
"title": "Proxied Records",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "blue", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
|
||||
"id": 3,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_dns_records_unproxied", "refId": "A"}],
|
||||
"title": "DNS-Only Records",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [
|
||||
{"options": {"0": {"color": "red", "index": 0, "text": "DISABLED"}}, "type": "value"},
|
||||
{"options": {"1": {"color": "green", "index": 1, "text": "ACTIVE"}}, "type": "value"}
|
||||
],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
|
||||
"id": 4,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_dnssec_enabled", "refId": "A"}],
|
||||
"title": "DNSSEC Status",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "palette-classic"},
|
||||
"custom": {"hideFrom": {"legend": false, "tooltip": false, "viz": false}},
|
||||
"mappings": []
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 10, "w": 12, "x": 0, "y": 4},
|
||||
"id": 5,
|
||||
"options": {
|
||||
"displayLabels": ["name", "value"],
|
||||
"legend": {"displayMode": "list", "placement": "right", "showLegend": true},
|
||||
"pieType": "pie",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"tooltip": {"mode": "single", "sort": "none"}
|
||||
},
|
||||
"targets": [
|
||||
{"expr": "cloudflare_dns_records_by_type{type=\"A\"}", "legendFormat": "A", "refId": "A"},
|
||||
{"expr": "cloudflare_dns_records_by_type{type=\"AAAA\"}", "legendFormat": "AAAA", "refId": "B"},
|
||||
{"expr": "cloudflare_dns_records_by_type{type=\"CNAME\"}", "legendFormat": "CNAME", "refId": "C"},
|
||||
{"expr": "cloudflare_dns_records_by_type{type=\"TXT\"}", "legendFormat": "TXT", "refId": "D"},
|
||||
{"expr": "cloudflare_dns_records_by_type{type=\"MX\"}", "legendFormat": "MX", "refId": "E"},
|
||||
{"expr": "cloudflare_dns_records_by_type{type=\"SRV\"}", "legendFormat": "SRV", "refId": "F"}
|
||||
],
|
||||
"title": "Records by Type",
|
||||
"type": "piechart"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "palette-classic"},
|
||||
"custom": {
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {"type": "linear"},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {"group": "A", "mode": "none"},
|
||||
"thresholdsStyle": {"mode": "off"}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 10, "w": 12, "x": 12, "y": 4},
|
||||
"id": 6,
|
||||
"options": {
|
||||
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
|
||||
"tooltip": {"mode": "single", "sort": "none"}
|
||||
},
|
||||
"targets": [
|
||||
{"expr": "cloudflare_dns_records_total", "legendFormat": "Total", "refId": "A"},
|
||||
{"expr": "cloudflare_dns_records_proxied", "legendFormat": "Proxied", "refId": "B"}
|
||||
],
|
||||
"title": "DNS Records Over Time",
|
||||
"type": "timeseries"
|
||||
}
|
||||
],
|
||||
"refresh": "1m",
|
||||
"schemaVersion": 38,
|
||||
"style": "dark",
|
||||
"tags": ["cloudflare", "dns"],
|
||||
"templating": {"list": []},
|
||||
"time": {"from": "now-24h", "to": "now"},
|
||||
"timepicker": {},
|
||||
"timezone": "utc",
|
||||
"title": "DNS Health",
|
||||
"uid": "cf-dns",
|
||||
"version": 1,
|
||||
"weekStart": ""
|
||||
}
|
||||
238
observatory/dashboards/invariants.json
Normal file
238
observatory/dashboards/invariants.json
Normal file
@@ -0,0 +1,238 @@
|
||||
{
|
||||
"annotations": {"list": []},
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 0,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"liveNow": false,
|
||||
"panels": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "blue", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
|
||||
"id": 1,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_invariants_total", "refId": "A"}],
|
||||
"title": "Total Invariants",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
|
||||
"id": 2,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_invariants_passed", "refId": "A"}],
|
||||
"title": "Passed",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "red", "value": 1}
|
||||
]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
|
||||
"id": 3,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_invariants_failed", "refId": "A"}],
|
||||
"title": "Failed",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"max": 100,
|
||||
"min": 0,
|
||||
"thresholds": {"mode": "absolute", "steps": [
|
||||
{"color": "red", "value": null},
|
||||
{"color": "yellow", "value": 80},
|
||||
{"color": "green", "value": 95}
|
||||
]},
|
||||
"unit": "percent"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
|
||||
"id": 4,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_invariants_pass_rate", "refId": "A"}],
|
||||
"title": "Pass Rate",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "palette-classic"},
|
||||
"custom": {
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 20,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "stepAfter",
|
||||
"lineWidth": 2,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {"type": "linear"},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {"group": "A", "mode": "none"},
|
||||
"thresholdsStyle": {"mode": "off"}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": [
|
||||
{
|
||||
"matcher": {"id": "byName", "options": "Failed"},
|
||||
"properties": [{"id": "color", "value": {"fixedColor": "red", "mode": "fixed"}}]
|
||||
},
|
||||
{
|
||||
"matcher": {"id": "byName", "options": "Passed"},
|
||||
"properties": [{"id": "color", "value": {"fixedColor": "green", "mode": "fixed"}}]
|
||||
}
|
||||
]
|
||||
},
|
||||
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
|
||||
"id": 5,
|
||||
"options": {
|
||||
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
|
||||
"tooltip": {"mode": "single", "sort": "none"}
|
||||
},
|
||||
"targets": [
|
||||
{"expr": "cloudflare_invariants_passed", "legendFormat": "Passed", "refId": "A"},
|
||||
{"expr": "cloudflare_invariants_failed", "legendFormat": "Failed", "refId": "B"}
|
||||
],
|
||||
"title": "Invariant Status Over Time",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "yellow", "value": 3600},
|
||||
{"color": "red", "value": 7200}
|
||||
]},
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 14},
|
||||
"id": 6,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_invariant_report_age_seconds", "refId": "A"}],
|
||||
"title": "Report Age",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "yellow", "value": 1},
|
||||
{"color": "red", "value": 5}
|
||||
]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 14},
|
||||
"id": 7,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_anomalies_last_24h", "refId": "A"}],
|
||||
"title": "Anomalies (Last 24h)",
|
||||
"type": "stat"
|
||||
}
|
||||
],
|
||||
"refresh": "1m",
|
||||
"schemaVersion": 38,
|
||||
"style": "dark",
|
||||
"tags": ["cloudflare", "invariants", "compliance"],
|
||||
"templating": {"list": []},
|
||||
"time": {"from": "now-7d", "to": "now"},
|
||||
"timepicker": {},
|
||||
"timezone": "utc",
|
||||
"title": "Invariants & Compliance",
|
||||
"uid": "cf-invariants",
|
||||
"version": 1,
|
||||
"weekStart": ""
|
||||
}
|
||||
217
observatory/dashboards/proofchain.json
Normal file
217
observatory/dashboards/proofchain.json
Normal file
@@ -0,0 +1,217 @@
|
||||
{
|
||||
"annotations": {"list": []},
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 0,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"liveNow": false,
|
||||
"panels": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [
|
||||
{"options": {"0": {"color": "red", "index": 0, "text": "MISSING"}}, "type": "value"},
|
||||
{"options": {"1": {"color": "green", "index": 1, "text": "SET"}}, "type": "value"}
|
||||
],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
|
||||
"id": 1,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_snapshot_merkle_root_set", "refId": "A"}],
|
||||
"title": "Merkle Root",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "yellow", "value": 3600},
|
||||
{"color": "red", "value": 7200}
|
||||
]},
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
|
||||
"id": 2,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_snapshot_age_seconds", "refId": "A"}],
|
||||
"title": "Snapshot Age",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "blue", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
|
||||
"id": 3,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_anomalies_total", "refId": "A"}],
|
||||
"title": "Total Anomalies",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "yellow", "value": 1},
|
||||
{"color": "red", "value": 5}
|
||||
]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
|
||||
"id": 4,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_anomalies_last_24h", "refId": "A"}],
|
||||
"title": "Anomalies (24h)",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "palette-classic"},
|
||||
"custom": {
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 2,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {"type": "linear"},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {"group": "A", "mode": "none"},
|
||||
"thresholdsStyle": {"mode": "off"}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]},
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 4},
|
||||
"id": 5,
|
||||
"options": {
|
||||
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
|
||||
"tooltip": {"mode": "single", "sort": "none"}
|
||||
},
|
||||
"targets": [
|
||||
{"expr": "cloudflare_snapshot_age_seconds", "legendFormat": "Snapshot Age", "refId": "A"},
|
||||
{"expr": "cloudflare_invariant_report_age_seconds", "legendFormat": "Report Age", "refId": "B"}
|
||||
],
|
||||
"title": "Data Freshness",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "palette-classic"},
|
||||
"custom": {
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "bars",
|
||||
"fillOpacity": 80,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {"type": "linear"},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {"group": "A", "mode": "none"},
|
||||
"thresholdsStyle": {"mode": "off"}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 12},
|
||||
"id": 6,
|
||||
"options": {
|
||||
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
|
||||
"tooltip": {"mode": "single", "sort": "none"}
|
||||
},
|
||||
"targets": [
|
||||
{"expr": "cloudflare_anomalies_last_24h", "legendFormat": "Anomalies", "refId": "A"}
|
||||
],
|
||||
"title": "Anomaly Timeline",
|
||||
"type": "timeseries"
|
||||
}
|
||||
],
|
||||
"refresh": "1m",
|
||||
"schemaVersion": 38,
|
||||
"style": "dark",
|
||||
"tags": ["cloudflare", "proofchain", "vaultmesh"],
|
||||
"templating": {"list": []},
|
||||
"time": {"from": "now-7d", "to": "now"},
|
||||
"timepicker": {},
|
||||
"timezone": "utc",
|
||||
"title": "ProofChain & Anchors",
|
||||
"uid": "cf-proofchain",
|
||||
"version": 1,
|
||||
"weekStart": ""
|
||||
}
|
||||
245
observatory/dashboards/security-settings.json
Normal file
245
observatory/dashboards/security-settings.json
Normal file
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"annotations": {"list": []},
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 0,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"liveNow": false,
|
||||
"panels": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [
|
||||
{"options": {"0": {"color": "red", "index": 0, "text": "OFF"}}, "type": "value"},
|
||||
{"options": {"1": {"color": "green", "index": 1, "text": "ON"}}, "type": "value"}
|
||||
],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
|
||||
"id": 1,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_zone_ssl_strict", "refId": "A"}],
|
||||
"title": "SSL Strict",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [
|
||||
{"options": {"0": {"color": "red", "index": 0, "text": "WEAK"}}, "type": "value"},
|
||||
{"options": {"1": {"color": "green", "index": 1, "text": "SECURE"}}, "type": "value"}
|
||||
],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
|
||||
"id": 2,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_zone_tls_version_secure", "refId": "A"}],
|
||||
"title": "TLS Version",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [
|
||||
{"options": {"0": {"color": "red", "index": 0, "text": "OFF"}}, "type": "value"},
|
||||
{"options": {"1": {"color": "green", "index": 1, "text": "ON"}}, "type": "value"}
|
||||
],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 4, "x": 8, "y": 0},
|
||||
"id": 3,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_zone_always_https", "refId": "A"}],
|
||||
"title": "Always HTTPS",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [
|
||||
{"options": {"0": {"color": "red", "index": 0, "text": "OFF"}}, "type": "value"},
|
||||
{"options": {"1": {"color": "green", "index": 1, "text": "ON"}}, "type": "value"}
|
||||
],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 4, "x": 12, "y": 0},
|
||||
"id": 4,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_zone_browser_check", "refId": "A"}],
|
||||
"title": "Browser Check",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [
|
||||
{"options": {"0": {"color": "red", "index": 0, "text": "DISABLED"}}, "type": "value"},
|
||||
{"options": {"1": {"color": "green", "index": 1, "text": "ACTIVE"}}, "type": "value"}
|
||||
],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 4, "x": 16, "y": 0},
|
||||
"id": 5,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_dnssec_enabled", "refId": "A"}],
|
||||
"title": "DNSSEC",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "blue", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 4, "x": 20, "y": 0},
|
||||
"id": 6,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_access_apps_total", "refId": "A"}],
|
||||
"title": "Access Apps",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"description": "Security posture score based on enabled security features",
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"max": 6,
|
||||
"min": 0,
|
||||
"thresholds": {"mode": "absolute", "steps": [
|
||||
{"color": "red", "value": null},
|
||||
{"color": "yellow", "value": 3},
|
||||
{"color": "green", "value": 5}
|
||||
]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
|
||||
"id": 7,
|
||||
"options": {
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"showThresholdLabels": false,
|
||||
"showThresholdMarkers": true
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudflare_zone_ssl_strict + cloudflare_zone_tls_version_secure + cloudflare_zone_always_https + cloudflare_zone_browser_check + cloudflare_dnssec_enabled + (cloudflare_tunnels_healthy > 0)",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Security Score",
|
||||
"type": "gauge"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "palette-classic"},
|
||||
"custom": {"hideFrom": {"legend": false, "tooltip": false, "viz": false}},
|
||||
"mappings": []
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
|
||||
"id": 8,
|
||||
"options": {
|
||||
"displayLabels": ["name", "value"],
|
||||
"legend": {"displayMode": "list", "placement": "right", "showLegend": true},
|
||||
"pieType": "pie",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"tooltip": {"mode": "single", "sort": "none"}
|
||||
},
|
||||
"targets": [
|
||||
{"expr": "cloudflare_access_apps_by_type{type=\"self_hosted\"}", "legendFormat": "Self-Hosted", "refId": "A"},
|
||||
{"expr": "cloudflare_access_apps_by_type{type=\"saas\"}", "legendFormat": "SaaS", "refId": "B"},
|
||||
{"expr": "cloudflare_access_apps_by_type{type=\"ssh\"}", "legendFormat": "SSH", "refId": "C"},
|
||||
{"expr": "cloudflare_access_apps_by_type{type=\"vnc\"}", "legendFormat": "VNC", "refId": "D"},
|
||||
{"expr": "cloudflare_access_apps_by_type{type=\"bookmark\"}", "legendFormat": "Bookmark", "refId": "E"}
|
||||
],
|
||||
"title": "Access Apps by Type",
|
||||
"type": "piechart"
|
||||
}
|
||||
],
|
||||
"refresh": "1m",
|
||||
"schemaVersion": 38,
|
||||
"style": "dark",
|
||||
"tags": ["cloudflare", "security", "access"],
|
||||
"templating": {"list": []},
|
||||
"time": {"from": "now-24h", "to": "now"},
|
||||
"timepicker": {},
|
||||
"timezone": "utc",
|
||||
"title": "Security Settings",
|
||||
"uid": "cf-security",
|
||||
"version": 1,
|
||||
"weekStart": ""
|
||||
}
|
||||
204
observatory/dashboards/tunnel-status.json
Normal file
204
observatory/dashboards/tunnel-status.json
Normal file
@@ -0,0 +1,204 @@
|
||||
{
|
||||
"annotations": {"list": []},
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 0,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"liveNow": false,
|
||||
"panels": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
|
||||
"id": 1,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_tunnels_total", "refId": "A"}],
|
||||
"title": "Total Tunnels",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [
|
||||
{"color": "red", "value": null},
|
||||
{"color": "green", "value": 1}
|
||||
]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
|
||||
"id": 2,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_tunnels_healthy", "refId": "A"}],
|
||||
"title": "Healthy Tunnels",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "red", "value": 1}
|
||||
]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
|
||||
"id": 3,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_tunnels_unhealthy", "refId": "A"}],
|
||||
"title": "Unhealthy Tunnels",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "blue", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
|
||||
"id": 4,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [{"expr": "cloudflare_tunnel_connections_total", "refId": "A"}],
|
||||
"title": "Total Connections",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "palette-classic"},
|
||||
"custom": {
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 2,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {"type": "linear"},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {"group": "A", "mode": "none"},
|
||||
"thresholdsStyle": {"mode": "off"}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {"mode": "absolute", "steps": [{"color": "green", "value": null}]}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 4},
|
||||
"id": 5,
|
||||
"options": {
|
||||
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
|
||||
"tooltip": {"mode": "single", "sort": "none"}
|
||||
},
|
||||
"targets": [
|
||||
{"expr": "cloudflare_tunnels_healthy", "legendFormat": "Healthy", "refId": "A"},
|
||||
{"expr": "cloudflare_tunnels_unhealthy", "legendFormat": "Unhealthy", "refId": "B"},
|
||||
{"expr": "cloudflare_tunnel_connections_total", "legendFormat": "Connections", "refId": "C"}
|
||||
],
|
||||
"title": "Tunnel Health Over Time",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"max": 100,
|
||||
"min": 0,
|
||||
"thresholds": {"mode": "absolute", "steps": [
|
||||
{"color": "red", "value": null},
|
||||
{"color": "yellow", "value": 50},
|
||||
{"color": "green", "value": 80}
|
||||
]},
|
||||
"unit": "percent"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 6, "w": 12, "x": 0, "y": 14},
|
||||
"id": 6,
|
||||
"options": {
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
|
||||
"showThresholdLabels": false,
|
||||
"showThresholdMarkers": true
|
||||
},
|
||||
"pluginVersion": "10.2.2",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "(cloudflare_tunnels_healthy / cloudflare_tunnels_total) * 100",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Tunnel Health Percentage",
|
||||
"type": "gauge"
|
||||
}
|
||||
],
|
||||
"refresh": "1m",
|
||||
"schemaVersion": 38,
|
||||
"style": "dark",
|
||||
"tags": ["cloudflare", "tunnel"],
|
||||
"templating": {"list": []},
|
||||
"time": {"from": "now-24h", "to": "now"},
|
||||
"timepicker": {},
|
||||
"timezone": "utc",
|
||||
"title": "Tunnel Status",
|
||||
"uid": "cf-tunnel",
|
||||
"version": 1,
|
||||
"weekStart": ""
|
||||
}
|
||||
13
observatory/datasources/prometheus.yml
Normal file
13
observatory/datasources/prometheus.yml
Normal file
@@ -0,0 +1,13 @@
|
||||
# Grafana Datasource Provisioning
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: Prometheus
|
||||
type: prometheus
|
||||
access: proxy
|
||||
url: http://prometheus:9090
|
||||
isDefault: true
|
||||
editable: false
|
||||
jsonData:
|
||||
timeInterval: "60s"
|
||||
httpMethod: POST
|
||||
123
observatory/docker-compose.yml
Normal file
123
observatory/docker-compose.yml
Normal file
@@ -0,0 +1,123 @@
|
||||
# Cloudflare Mesh Observatory Docker Stack
|
||||
# Prometheus + Grafana + Alertmanager + Custom Metrics Exporter
|
||||
# Phase 5B - Full Observability + Alerting
|
||||
|
||||
services:
|
||||
# Prometheus - Metrics Collection
|
||||
prometheus:
|
||||
image: prom/prometheus:v2.48.0
|
||||
container_name: cf-prometheus
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "9090:9090"
|
||||
volumes:
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
|
||||
- ./prometheus/alerts:/etc/prometheus/alerts:ro
|
||||
- prometheus_data:/prometheus
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
- '--storage.tsdb.path=/prometheus'
|
||||
- '--storage.tsdb.retention.time=30d'
|
||||
- '--web.enable-lifecycle'
|
||||
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
|
||||
- '--web.console.templates=/usr/share/prometheus/consoles'
|
||||
networks:
|
||||
- observatory
|
||||
depends_on:
|
||||
- alertmanager
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "-q", "--spider", "http://localhost:9090/-/healthy"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
# Alertmanager - Alert Routing & Notifications
|
||||
alertmanager:
|
||||
image: prom/alertmanager:v0.26.0
|
||||
container_name: cf-alertmanager
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "9093:9093"
|
||||
volumes:
|
||||
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
|
||||
- ./alertmanager/templates:/etc/alertmanager/templates:ro
|
||||
- alertmanager_data:/alertmanager
|
||||
command:
|
||||
- '--config.file=/etc/alertmanager/alertmanager.yml'
|
||||
- '--storage.path=/alertmanager'
|
||||
- '--web.listen-address=:9093'
|
||||
- '--cluster.listen-address='
|
||||
environment:
|
||||
- SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
|
||||
- PAGERDUTY_SERVICE_KEY=${PAGERDUTY_SERVICE_KEY}
|
||||
- SMTP_USERNAME=${SMTP_USERNAME}
|
||||
- SMTP_PASSWORD=${SMTP_PASSWORD}
|
||||
networks:
|
||||
- observatory
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "-q", "--spider", "http://localhost:9093/-/healthy"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
# Grafana - Visualization
|
||||
grafana:
|
||||
image: grafana/grafana:10.2.2
|
||||
container_name: cf-grafana
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "3000:3000"
|
||||
environment:
|
||||
- GF_SECURITY_ADMIN_USER=admin
|
||||
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-changeme}
|
||||
- GF_USERS_ALLOW_SIGN_UP=false
|
||||
- GF_SERVER_ROOT_URL=%(protocol)s://%(domain)s:%(http_port)s/
|
||||
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel
|
||||
volumes:
|
||||
- grafana_data:/var/lib/grafana
|
||||
- ./dashboards:/etc/grafana/provisioning/dashboards:ro
|
||||
- ./datasources:/etc/grafana/provisioning/datasources:ro
|
||||
networks:
|
||||
- observatory
|
||||
depends_on:
|
||||
- prometheus
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "wget -q --spider http://localhost:3000/api/health || exit 1"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
# Cloudflare Metrics Exporter
|
||||
metrics-exporter:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.exporter
|
||||
container_name: cf-metrics-exporter
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "9100:9100"
|
||||
environment:
|
||||
- CLOUDFLARE_API_TOKEN=${CLOUDFLARE_API_TOKEN}
|
||||
- CLOUDFLARE_ZONE_ID=${CLOUDFLARE_ZONE_ID}
|
||||
- CLOUDFLARE_ACCOUNT_ID=${CLOUDFLARE_ACCOUNT_ID}
|
||||
- SNAPSHOT_DIR=/data/snapshots
|
||||
- ANOMALY_DIR=/data/anomalies
|
||||
volumes:
|
||||
- ../snapshots:/data/snapshots:ro
|
||||
- ../anomalies:/data/anomalies:ro
|
||||
networks:
|
||||
- observatory
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "-q", "--spider", "http://localhost:9100/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
networks:
|
||||
observatory:
|
||||
driver: bridge
|
||||
|
||||
volumes:
|
||||
prometheus_data:
|
||||
grafana_data:
|
||||
alertmanager_data:
|
||||
344
observatory/drift-visualizer.py
Normal file
344
observatory/drift-visualizer.py
Normal file
@@ -0,0 +1,344 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Drift Visualizer
|
||||
Compares Terraform state, DNS manifest, and live Cloudflare configuration.
|
||||
Outputs JSON diff and HTML report.
|
||||
|
||||
Usage:
|
||||
python3 drift-visualizer.py --snapshot <path> --manifest <path> --output <dir>
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import html
|
||||
import json
|
||||
import os
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any, Dict, List, Optional, Set, Tuple
|
||||
|
||||
OUTPUT_DIR = os.path.join(os.path.dirname(os.path.dirname(__file__)), "reports")
|
||||
|
||||
|
||||
class DriftAnalyzer:
|
||||
"""Analyzes drift between different state sources."""
|
||||
|
||||
def __init__(self):
|
||||
self.diffs: List[Dict[str, Any]] = []
|
||||
|
||||
def compare_dns_records(
|
||||
self,
|
||||
source_name: str,
|
||||
source_records: List[Dict],
|
||||
target_name: str,
|
||||
target_records: List[Dict]
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Compare DNS records between two sources."""
|
||||
diffs = []
|
||||
|
||||
# Build lookup maps
|
||||
source_map = {(r.get("type"), r.get("name")): r for r in source_records}
|
||||
target_map = {(r.get("type"), r.get("name")): r for r in target_records}
|
||||
|
||||
all_keys = set(source_map.keys()) | set(target_map.keys())
|
||||
|
||||
for key in all_keys:
|
||||
rtype, name = key
|
||||
source_rec = source_map.get(key)
|
||||
target_rec = target_map.get(key)
|
||||
|
||||
if source_rec and not target_rec:
|
||||
diffs.append({
|
||||
"type": "missing",
|
||||
"source": source_name,
|
||||
"target": target_name,
|
||||
"record_type": rtype,
|
||||
"record_name": name,
|
||||
"detail": f"Record exists in {source_name} but not in {target_name}",
|
||||
"severity": "high",
|
||||
})
|
||||
elif target_rec and not source_rec:
|
||||
diffs.append({
|
||||
"type": "extra",
|
||||
"source": source_name,
|
||||
"target": target_name,
|
||||
"record_type": rtype,
|
||||
"record_name": name,
|
||||
"detail": f"Record exists in {target_name} but not in {source_name}",
|
||||
"severity": "medium",
|
||||
})
|
||||
else:
|
||||
# Both exist - check for content/config drift
|
||||
content_diff = self._compare_record_content(source_rec, target_rec)
|
||||
if content_diff:
|
||||
diffs.append({
|
||||
"type": "modified",
|
||||
"source": source_name,
|
||||
"target": target_name,
|
||||
"record_type": rtype,
|
||||
"record_name": name,
|
||||
"detail": content_diff,
|
||||
"source_value": source_rec,
|
||||
"target_value": target_rec,
|
||||
"severity": "medium",
|
||||
})
|
||||
|
||||
return diffs
|
||||
|
||||
def _compare_record_content(self, rec1: Dict, rec2: Dict) -> Optional[str]:
|
||||
"""Compare record content and return diff description."""
|
||||
diffs = []
|
||||
|
||||
if rec1.get("content") != rec2.get("content"):
|
||||
diffs.append(f"content: {rec1.get('content')} -> {rec2.get('content')}")
|
||||
|
||||
if rec1.get("proxied") != rec2.get("proxied"):
|
||||
diffs.append(f"proxied: {rec1.get('proxied')} -> {rec2.get('proxied')}")
|
||||
|
||||
if rec1.get("ttl") != rec2.get("ttl"):
|
||||
diffs.append(f"ttl: {rec1.get('ttl')} -> {rec2.get('ttl')}")
|
||||
|
||||
return "; ".join(diffs) if diffs else None
|
||||
|
||||
def compare_settings(
|
||||
self,
|
||||
source_name: str,
|
||||
source_settings: Dict,
|
||||
target_name: str,
|
||||
target_settings: Dict
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Compare zone settings."""
|
||||
diffs = []
|
||||
all_keys = set(source_settings.keys()) | set(target_settings.keys())
|
||||
|
||||
for key in all_keys:
|
||||
src_val = source_settings.get(key)
|
||||
tgt_val = target_settings.get(key)
|
||||
|
||||
if src_val != tgt_val:
|
||||
diffs.append({
|
||||
"type": "setting_drift",
|
||||
"source": source_name,
|
||||
"target": target_name,
|
||||
"setting": key,
|
||||
"source_value": src_val,
|
||||
"target_value": tgt_val,
|
||||
"severity": "medium" if key in ("ssl", "min_tls_version") else "low",
|
||||
})
|
||||
|
||||
return diffs
|
||||
|
||||
def analyze(
|
||||
self,
|
||||
snapshot: Optional[Dict] = None,
|
||||
manifest: Optional[Dict] = None,
|
||||
terraform_state: Optional[Dict] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""Run full drift analysis."""
|
||||
self.diffs = []
|
||||
comparisons = []
|
||||
|
||||
# Snapshot vs Manifest
|
||||
if snapshot and manifest:
|
||||
snapshot_dns = snapshot.get("state", {}).get("dns", {}).get("records", [])
|
||||
manifest_dns = manifest.get("records", [])
|
||||
|
||||
dns_diffs = self.compare_dns_records(
|
||||
"manifest", manifest_dns,
|
||||
"cloudflare", snapshot_dns
|
||||
)
|
||||
self.diffs.extend(dns_diffs)
|
||||
comparisons.append("manifest_vs_cloudflare")
|
||||
|
||||
# Summary
|
||||
high = len([d for d in self.diffs if d.get("severity") == "high"])
|
||||
medium = len([d for d in self.diffs if d.get("severity") == "medium"])
|
||||
low = len([d for d in self.diffs if d.get("severity") == "low"])
|
||||
|
||||
return {
|
||||
"analysis_type": "drift_report",
|
||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||
"comparisons": comparisons,
|
||||
"summary": {
|
||||
"total_diffs": len(self.diffs),
|
||||
"high_severity": high,
|
||||
"medium_severity": medium,
|
||||
"low_severity": low,
|
||||
"drift_detected": len(self.diffs) > 0,
|
||||
},
|
||||
"diffs": self.diffs,
|
||||
}
|
||||
|
||||
|
||||
def generate_html_report(analysis: Dict[str, Any]) -> str:
|
||||
"""Generate HTML visualization of drift report."""
|
||||
timestamp = analysis.get("timestamp", "")
|
||||
summary = analysis.get("summary", {})
|
||||
diffs = analysis.get("diffs", [])
|
||||
|
||||
# CSS styles
|
||||
css = """
|
||||
<style>
|
||||
body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
|
||||
max-width: 1200px; margin: 0 auto; padding: 20px; background: #0d1117; color: #c9d1d9; }
|
||||
h1 { color: #58a6ff; border-bottom: 1px solid #30363d; padding-bottom: 10px; }
|
||||
h2 { color: #8b949e; }
|
||||
.summary { display: flex; gap: 20px; margin: 20px 0; }
|
||||
.card { background: #161b22; padding: 20px; border-radius: 8px; border: 1px solid #30363d; flex: 1; }
|
||||
.card h3 { margin-top: 0; color: #58a6ff; }
|
||||
.stat { font-size: 2em; font-weight: bold; }
|
||||
.high { color: #f85149; }
|
||||
.medium { color: #d29922; }
|
||||
.low { color: #3fb950; }
|
||||
.ok { color: #3fb950; }
|
||||
table { width: 100%; border-collapse: collapse; margin: 20px 0; }
|
||||
th, td { padding: 12px; text-align: left; border-bottom: 1px solid #30363d; }
|
||||
th { background: #161b22; color: #8b949e; }
|
||||
tr:hover { background: #161b22; }
|
||||
.badge { padding: 4px 8px; border-radius: 4px; font-size: 0.8em; font-weight: bold; }
|
||||
.badge-high { background: #f85149; color: white; }
|
||||
.badge-medium { background: #d29922; color: black; }
|
||||
.badge-low { background: #238636; color: white; }
|
||||
.badge-missing { background: #f85149; }
|
||||
.badge-extra { background: #d29922; }
|
||||
.badge-modified { background: #1f6feb; color: white; }
|
||||
.no-drift { text-align: center; padding: 40px; color: #3fb950; }
|
||||
code { background: #21262d; padding: 2px 6px; border-radius: 4px; }
|
||||
</style>
|
||||
"""
|
||||
|
||||
# Header
|
||||
html_parts = [
|
||||
"<!DOCTYPE html>",
|
||||
"<html><head>",
|
||||
"<meta charset='utf-8'>",
|
||||
"<title>Cloudflare Drift Report</title>",
|
||||
css,
|
||||
"</head><body>",
|
||||
"<h1>Cloudflare Drift Report</h1>",
|
||||
f"<p>Generated: {timestamp}</p>",
|
||||
]
|
||||
|
||||
# Summary cards
|
||||
html_parts.append("<div class='summary'>")
|
||||
html_parts.append(f"""
|
||||
<div class='card'>
|
||||
<h3>Total Diffs</h3>
|
||||
<div class='stat {"ok" if summary.get("total_diffs") == 0 else "high"}'>{summary.get("total_diffs", 0)}</div>
|
||||
</div>
|
||||
""")
|
||||
html_parts.append(f"""
|
||||
<div class='card'>
|
||||
<h3>High Severity</h3>
|
||||
<div class='stat high'>{summary.get("high_severity", 0)}</div>
|
||||
</div>
|
||||
""")
|
||||
html_parts.append(f"""
|
||||
<div class='card'>
|
||||
<h3>Medium Severity</h3>
|
||||
<div class='stat medium'>{summary.get("medium_severity", 0)}</div>
|
||||
</div>
|
||||
""")
|
||||
html_parts.append(f"""
|
||||
<div class='card'>
|
||||
<h3>Low Severity</h3>
|
||||
<div class='stat low'>{summary.get("low_severity", 0)}</div>
|
||||
</div>
|
||||
""")
|
||||
html_parts.append("</div>")
|
||||
|
||||
# Diffs table
|
||||
if diffs:
|
||||
html_parts.append("<h2>Drift Details</h2>")
|
||||
html_parts.append("<table>")
|
||||
html_parts.append("""
|
||||
<tr>
|
||||
<th>Type</th>
|
||||
<th>Severity</th>
|
||||
<th>Record</th>
|
||||
<th>Detail</th>
|
||||
</tr>
|
||||
""")
|
||||
|
||||
for diff in diffs:
|
||||
dtype = diff.get("type", "unknown")
|
||||
severity = diff.get("severity", "low")
|
||||
record = f"{diff.get('record_type', '')} {diff.get('record_name', '')}"
|
||||
detail = html.escape(str(diff.get("detail", "")))
|
||||
|
||||
html_parts.append(f"""
|
||||
<tr>
|
||||
<td><span class='badge badge-{dtype}'>{dtype}</span></td>
|
||||
<td><span class='badge badge-{severity}'>{severity.upper()}</span></td>
|
||||
<td><code>{html.escape(record)}</code></td>
|
||||
<td>{detail}</td>
|
||||
</tr>
|
||||
""")
|
||||
|
||||
html_parts.append("</table>")
|
||||
else:
|
||||
html_parts.append("<div class='no-drift'>No drift detected. Configuration is in sync.</div>")
|
||||
|
||||
html_parts.append("</body></html>")
|
||||
return "\n".join(html_parts)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Drift Visualizer")
|
||||
parser.add_argument("--snapshot", help="Path to state snapshot JSON")
|
||||
parser.add_argument("--manifest", help="Path to DNS manifest JSON/YAML")
|
||||
parser.add_argument("--output-dir", default=OUTPUT_DIR, help="Output directory")
|
||||
parser.add_argument("--format", choices=["json", "html", "both"], default="both",
|
||||
help="Output format")
|
||||
args = parser.parse_args()
|
||||
|
||||
# Load files
|
||||
snapshot = None
|
||||
manifest = None
|
||||
|
||||
if args.snapshot:
|
||||
with open(args.snapshot) as f:
|
||||
snapshot = json.load(f)
|
||||
|
||||
if args.manifest:
|
||||
with open(args.manifest) as f:
|
||||
manifest = json.load(f)
|
||||
|
||||
if not snapshot and not manifest:
|
||||
print("Error: At least one of --snapshot or --manifest required")
|
||||
return 1
|
||||
|
||||
# Ensure output directory
|
||||
os.makedirs(args.output_dir, exist_ok=True)
|
||||
|
||||
# Run analysis
|
||||
analyzer = DriftAnalyzer()
|
||||
analysis = analyzer.analyze(snapshot=snapshot, manifest=manifest)
|
||||
|
||||
# Output
|
||||
timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H-%M-%SZ")
|
||||
|
||||
if args.format in ("json", "both"):
|
||||
json_path = os.path.join(args.output_dir, f"drift-report-{timestamp}.json")
|
||||
with open(json_path, "w") as f:
|
||||
json.dump(analysis, f, indent=2)
|
||||
print(f"JSON report: {json_path}")
|
||||
|
||||
if args.format in ("html", "both"):
|
||||
html_content = generate_html_report(analysis)
|
||||
html_path = os.path.join(args.output_dir, f"drift-report-{timestamp}.html")
|
||||
with open(html_path, "w") as f:
|
||||
f.write(html_content)
|
||||
print(f"HTML report: {html_path}")
|
||||
|
||||
# Summary
|
||||
summary = analysis.get("summary", {})
|
||||
print(f"\nDrift Summary:")
|
||||
print(f" Total diffs: {summary.get('total_diffs', 0)}")
|
||||
print(f" High: {summary.get('high_severity', 0)}")
|
||||
print(f" Medium: {summary.get('medium_severity', 0)}")
|
||||
print(f" Low: {summary.get('low_severity', 0)}")
|
||||
|
||||
return 0 if summary.get("total_diffs", 0) == 0 else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
exit(main())
|
||||
351
observatory/escalation-matrix.yml
Normal file
351
observatory/escalation-matrix.yml
Normal file
@@ -0,0 +1,351 @@
|
||||
# Cloudflare Mesh Observatory - Escalation Matrix
|
||||
# Phase 5B - Alerts & Escalation
|
||||
#
|
||||
# This matrix defines who gets notified for what, and when to escalate.
|
||||
# Used by Alertmanager routing and for human reference.
|
||||
|
||||
---
|
||||
version: "1.0"
|
||||
last_updated: "2024-01-01"
|
||||
|
||||
# ==============================================================================
|
||||
# SEVERITY DEFINITIONS
|
||||
# ==============================================================================
|
||||
severity_definitions:
|
||||
critical:
|
||||
description: "Service down, security incident, or data integrity issue"
|
||||
response_time: "15 minutes"
|
||||
notification_channels: ["pagerduty", "slack-critical", "phone"]
|
||||
escalation_after: "30 minutes"
|
||||
|
||||
warning:
|
||||
description: "Degraded service, policy violation, or impending issue"
|
||||
response_time: "1 hour"
|
||||
notification_channels: ["slack"]
|
||||
escalation_after: "4 hours"
|
||||
|
||||
info:
|
||||
description: "Informational, audit, or metric threshold"
|
||||
response_time: "Next business day"
|
||||
notification_channels: ["email-digest"]
|
||||
escalation_after: null
|
||||
|
||||
# ==============================================================================
|
||||
# ESCALATION CHAINS
|
||||
# ==============================================================================
|
||||
escalation_chains:
|
||||
infrastructure:
|
||||
name: "Infrastructure Team"
|
||||
stages:
|
||||
- stage: 1
|
||||
delay: "0m"
|
||||
contacts: ["infra-oncall"]
|
||||
channels: ["pagerduty", "slack"]
|
||||
- stage: 2
|
||||
delay: "30m"
|
||||
contacts: ["infra-lead"]
|
||||
channels: ["pagerduty", "phone"]
|
||||
- stage: 3
|
||||
delay: "1h"
|
||||
contacts: ["platform-director"]
|
||||
channels: ["phone"]
|
||||
|
||||
security:
|
||||
name: "Security Team"
|
||||
stages:
|
||||
- stage: 1
|
||||
delay: "0m"
|
||||
contacts: ["security-oncall"]
|
||||
channels: ["pagerduty", "slack-security"]
|
||||
- stage: 2
|
||||
delay: "15m"
|
||||
contacts: ["security-lead", "ciso"]
|
||||
channels: ["pagerduty", "phone"]
|
||||
|
||||
platform:
|
||||
name: "Platform Team"
|
||||
stages:
|
||||
- stage: 1
|
||||
delay: "0m"
|
||||
contacts: ["platform-oncall"]
|
||||
channels: ["slack"]
|
||||
- stage: 2
|
||||
delay: "1h"
|
||||
contacts: ["platform-lead"]
|
||||
channels: ["pagerduty"]
|
||||
|
||||
# ==============================================================================
|
||||
# COMPONENT -> ESCALATION CHAIN MAPPING
|
||||
# ==============================================================================
|
||||
component_ownership:
|
||||
tunnel:
|
||||
primary_chain: infrastructure
|
||||
backup_chain: platform
|
||||
slack_channel: "#cloudflare-tunnels"
|
||||
playbooks:
|
||||
- "TUNNEL-ROTATION-PROTOCOL.md"
|
||||
|
||||
dns:
|
||||
primary_chain: infrastructure
|
||||
backup_chain: security # DNS can be security-related
|
||||
slack_channel: "#cloudflare-dns"
|
||||
playbooks:
|
||||
- "DNS-COMPROMISE-PLAYBOOK.md"
|
||||
|
||||
waf:
|
||||
primary_chain: security
|
||||
backup_chain: infrastructure
|
||||
slack_channel: "#cloudflare-waf"
|
||||
playbooks:
|
||||
- "waf_incident_playbook.md"
|
||||
|
||||
invariant:
|
||||
primary_chain: security
|
||||
backup_chain: platform
|
||||
slack_channel: "#cloudflare-security"
|
||||
playbooks:
|
||||
- "SECURITY-INVARIANTS.md"
|
||||
|
||||
proofchain:
|
||||
primary_chain: platform
|
||||
backup_chain: security
|
||||
slack_channel: "#cloudflare-proofchain"
|
||||
playbooks:
|
||||
- "proofchain-incident.md"
|
||||
|
||||
# ==============================================================================
|
||||
# ALERT -> RESPONSE MAPPING
|
||||
# ==============================================================================
|
||||
alert_responses:
|
||||
# TUNNEL ALERTS
|
||||
TunnelDown:
|
||||
severity: critical
|
||||
escalation_chain: infrastructure
|
||||
immediate_actions:
|
||||
- "Check cloudflared service status"
|
||||
- "Verify network connectivity to origin"
|
||||
- "Check Cloudflare status page"
|
||||
playbook: "TUNNEL-ROTATION-PROTOCOL.md"
|
||||
auto_remediation: false # Manual intervention required
|
||||
|
||||
AllTunnelsDown:
|
||||
severity: critical
|
||||
escalation_chain: infrastructure
|
||||
immediate_actions:
|
||||
- "DECLARE INCIDENT"
|
||||
- "Check all cloudflared instances"
|
||||
- "Verify DNS resolution"
|
||||
- "Check for Cloudflare outage"
|
||||
playbook: "TUNNEL-ROTATION-PROTOCOL.md"
|
||||
auto_remediation: false
|
||||
|
||||
TunnelRotationDue:
|
||||
severity: warning
|
||||
escalation_chain: platform
|
||||
immediate_actions:
|
||||
- "Schedule maintenance window"
|
||||
- "Prepare new tunnel credentials"
|
||||
playbook: "TUNNEL-ROTATION-PROTOCOL.md"
|
||||
auto_remediation: true # Can be auto-scheduled
|
||||
|
||||
# DNS ALERTS
|
||||
DNSHijackDetected:
|
||||
severity: critical
|
||||
escalation_chain: security
|
||||
immediate_actions:
|
||||
- "DECLARE SECURITY INCIDENT"
|
||||
- "Verify DNS resolution from multiple locations"
|
||||
- "Check Cloudflare audit logs"
|
||||
- "Preserve evidence"
|
||||
playbook: "DNS-COMPROMISE-PLAYBOOK.md"
|
||||
auto_remediation: false # NEVER auto-remediate security incidents
|
||||
|
||||
DNSDriftDetected:
|
||||
severity: warning
|
||||
escalation_chain: infrastructure
|
||||
immediate_actions:
|
||||
- "Run state reconciler"
|
||||
- "Identify changed records"
|
||||
- "Verify authorization"
|
||||
playbook: "DNS-COMPROMISE-PLAYBOOK.md"
|
||||
auto_remediation: true # Can auto-reconcile if authorized
|
||||
|
||||
# WAF ALERTS
|
||||
WAFMassiveAttack:
|
||||
severity: critical
|
||||
escalation_chain: security
|
||||
immediate_actions:
|
||||
- "Verify attack is real (not false positive)"
|
||||
- "Consider Under Attack Mode"
|
||||
- "Check rate limiting"
|
||||
- "Document attack patterns"
|
||||
playbook: "waf_incident_playbook.md"
|
||||
auto_remediation: false
|
||||
|
||||
WAFRuleBypass:
|
||||
severity: critical
|
||||
escalation_chain: security
|
||||
immediate_actions:
|
||||
- "Analyze bypassed requests"
|
||||
- "Tighten rule immediately"
|
||||
- "Check for related vulnerabilities"
|
||||
playbook: "waf_incident_playbook.md"
|
||||
auto_remediation: false
|
||||
|
||||
WAFDisabled:
|
||||
severity: critical
|
||||
escalation_chain: security
|
||||
immediate_actions:
|
||||
- "IMMEDIATELY investigate why WAF is disabled"
|
||||
- "Re-enable unless documented exception"
|
||||
- "Review audit logs"
|
||||
playbook: "waf_incident_playbook.md"
|
||||
auto_remediation: true # Auto-enable WAF
|
||||
|
||||
# INVARIANT ALERTS
|
||||
SSLModeDowngraded:
|
||||
severity: critical
|
||||
escalation_chain: security
|
||||
immediate_actions:
|
||||
- "Restore Full (Strict) SSL mode"
|
||||
- "Investigate who made the change"
|
||||
- "Review audit logs"
|
||||
playbook: null
|
||||
auto_remediation: true # Auto-restore SSL mode
|
||||
|
||||
AccessPolicyViolation:
|
||||
severity: critical
|
||||
escalation_chain: security
|
||||
immediate_actions:
|
||||
- "Review access attempt"
|
||||
- "Block if malicious"
|
||||
- "Notify affected user if legitimate"
|
||||
playbook: null
|
||||
auto_remediation: false
|
||||
|
||||
# PROOFCHAIN ALERTS
|
||||
ProofchainIntegrityFailure:
|
||||
severity: critical
|
||||
escalation_chain: security
|
||||
immediate_actions:
|
||||
- "HALT all new receipt generation"
|
||||
- "Preserve current state"
|
||||
- "Identify last known-good checkpoint"
|
||||
- "Do NOT attempt auto-recovery"
|
||||
playbook: null
|
||||
auto_remediation: false # NEVER auto-remediate integrity failures
|
||||
|
||||
ReceiptHashMismatch:
|
||||
severity: critical
|
||||
escalation_chain: security
|
||||
immediate_actions:
|
||||
- "Identify affected receipt"
|
||||
- "Compare against backup"
|
||||
- "Preserve for forensics"
|
||||
playbook: null
|
||||
auto_remediation: false
|
||||
|
||||
# ==============================================================================
|
||||
# CONTACTS
|
||||
# ==============================================================================
|
||||
contacts:
|
||||
infra-oncall:
|
||||
name: "Infrastructure On-Call"
|
||||
pagerduty_service: "PXXXXXX"
|
||||
slack_handle: "@infra-oncall"
|
||||
schedule: "follow-the-sun"
|
||||
|
||||
infra-lead:
|
||||
name: "Infrastructure Team Lead"
|
||||
pagerduty_user: "UXXXXXX"
|
||||
phone: "+1-XXX-XXX-XXXX"
|
||||
email: "infra-lead@company.com"
|
||||
|
||||
security-oncall:
|
||||
name: "Security On-Call"
|
||||
pagerduty_service: "PXXXXXX"
|
||||
slack_handle: "@security-oncall"
|
||||
schedule: "24x7"
|
||||
|
||||
security-lead:
|
||||
name: "Security Team Lead"
|
||||
pagerduty_user: "UXXXXXX"
|
||||
phone: "+1-XXX-XXX-XXXX"
|
||||
email: "security-lead@company.com"
|
||||
|
||||
ciso:
|
||||
name: "Chief Information Security Officer"
|
||||
phone: "+1-XXX-XXX-XXXX"
|
||||
email: "ciso@company.com"
|
||||
|
||||
platform-oncall:
|
||||
name: "Platform On-Call"
|
||||
pagerduty_service: "PXXXXXX"
|
||||
slack_handle: "@platform-oncall"
|
||||
|
||||
platform-lead:
|
||||
name: "Platform Team Lead"
|
||||
pagerduty_user: "UXXXXXX"
|
||||
email: "platform-lead@company.com"
|
||||
|
||||
platform-director:
|
||||
name: "Platform Director"
|
||||
phone: "+1-XXX-XXX-XXXX"
|
||||
email: "platform-director@company.com"
|
||||
|
||||
# ==============================================================================
|
||||
# NOTIFICATION CHANNELS
|
||||
# ==============================================================================
|
||||
channels:
|
||||
slack:
|
||||
default: "#cloudflare-alerts"
|
||||
critical: "#cloudflare-critical"
|
||||
tunnels: "#cloudflare-tunnels"
|
||||
dns: "#cloudflare-dns"
|
||||
waf: "#cloudflare-waf"
|
||||
security: "#cloudflare-security"
|
||||
proofchain: "#cloudflare-proofchain"
|
||||
|
||||
pagerduty:
|
||||
integration_key: "${PAGERDUTY_SERVICE_KEY}"
|
||||
escalation_policy: "cloudflare-infrastructure"
|
||||
|
||||
email:
|
||||
daily_digest: "cloudflare-team@company.com"
|
||||
weekly_report: "platform-leadership@company.com"
|
||||
|
||||
# ==============================================================================
|
||||
# AUTO-REMEDIATION POLICIES
|
||||
# ==============================================================================
|
||||
auto_remediation:
|
||||
enabled: true
|
||||
require_confirmation_for:
|
||||
- "critical"
|
||||
- "security_incident"
|
||||
never_auto_remediate:
|
||||
- "ProofchainIntegrityFailure"
|
||||
- "ReceiptHashMismatch"
|
||||
- "DNSHijackDetected"
|
||||
- "WAFRuleBypass"
|
||||
max_auto_remediations_per_hour: 5
|
||||
cooldown_period: "10m"
|
||||
|
||||
# ==============================================================================
|
||||
# MAINTENANCE WINDOWS
|
||||
# ==============================================================================
|
||||
maintenance_windows:
|
||||
weekly_rotation:
|
||||
schedule: "0 3 * * SUN" # 3 AM Sunday
|
||||
duration: "2h"
|
||||
suppress_alerts:
|
||||
- "TunnelDown"
|
||||
- "TunnelDegraded"
|
||||
notify_channel: "#cloudflare-alerts"
|
||||
|
||||
monthly_patch:
|
||||
schedule: "0 2 15 * *" # 2 AM on the 15th
|
||||
duration: "4h"
|
||||
suppress_alerts:
|
||||
- "TunnelDown"
|
||||
- "CloudflaredOutdated"
|
||||
notify_channel: "#cloudflare-alerts"
|
||||
355
observatory/metrics-exporter.py
Normal file
355
observatory/metrics-exporter.py
Normal file
@@ -0,0 +1,355 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Cloudflare Metrics Exporter for Prometheus
|
||||
Exports Cloudflare state and invariant status as Prometheus metrics.
|
||||
|
||||
Usage:
|
||||
python3 metrics-exporter.py --port 9100
|
||||
|
||||
Environment Variables:
|
||||
CLOUDFLARE_API_TOKEN - API token
|
||||
CLOUDFLARE_ZONE_ID - Zone ID
|
||||
CLOUDFLARE_ACCOUNT_ID - Account ID
|
||||
SNAPSHOT_DIR - Directory containing state snapshots
|
||||
ANOMALY_DIR - Directory containing invariant reports
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import glob
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from http.server import HTTPServer, BaseHTTPRequestHandler
|
||||
from typing import Any, Dict, List, Optional
|
||||
import requests
|
||||
|
||||
# Configuration
|
||||
CF_API_BASE = "https://api.cloudflare.com/client/v4"
|
||||
DEFAULT_PORT = 9100
|
||||
SCRAPE_INTERVAL = 60 # seconds
|
||||
|
||||
|
||||
class CloudflareMetricsCollector:
|
||||
"""Collects Cloudflare metrics for Prometheus export."""
|
||||
|
||||
def __init__(self, api_token: str, zone_id: str, account_id: str,
|
||||
snapshot_dir: str, anomaly_dir: str):
|
||||
self.api_token = api_token
|
||||
self.zone_id = zone_id
|
||||
self.account_id = account_id
|
||||
self.snapshot_dir = snapshot_dir
|
||||
self.anomaly_dir = anomaly_dir
|
||||
self.session = requests.Session()
|
||||
self.session.headers.update({
|
||||
"Authorization": f"Bearer {api_token}",
|
||||
"Content-Type": "application/json"
|
||||
})
|
||||
self.metrics: Dict[str, Any] = {}
|
||||
self.last_scrape = 0
|
||||
|
||||
def _cf_request(self, endpoint: str) -> Dict[str, Any]:
|
||||
"""Make Cloudflare API request."""
|
||||
url = f"{CF_API_BASE}{endpoint}"
|
||||
response = self.session.get(url)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
def _get_latest_file(self, pattern: str) -> Optional[str]:
|
||||
"""Get most recent file matching pattern."""
|
||||
files = glob.glob(pattern)
|
||||
if not files:
|
||||
return None
|
||||
return max(files, key=os.path.getmtime)
|
||||
|
||||
def collect_dns_metrics(self):
|
||||
"""Collect DNS record metrics."""
|
||||
try:
|
||||
data = self._cf_request(f"/zones/{self.zone_id}/dns_records?per_page=500")
|
||||
records = data.get("result", [])
|
||||
|
||||
# Count by type
|
||||
type_counts = {}
|
||||
proxied_count = 0
|
||||
unproxied_count = 0
|
||||
|
||||
for r in records:
|
||||
rtype = r.get("type", "UNKNOWN")
|
||||
type_counts[rtype] = type_counts.get(rtype, 0) + 1
|
||||
if r.get("proxied"):
|
||||
proxied_count += 1
|
||||
else:
|
||||
unproxied_count += 1
|
||||
|
||||
self.metrics["dns_records_total"] = len(records)
|
||||
self.metrics["dns_records_proxied"] = proxied_count
|
||||
self.metrics["dns_records_unproxied"] = unproxied_count
|
||||
|
||||
for rtype, count in type_counts.items():
|
||||
self.metrics[f"dns_records_by_type{{type=\"{rtype}\"}}"] = count
|
||||
|
||||
except Exception as e:
|
||||
self.metrics["dns_scrape_errors_total"] = self.metrics.get("dns_scrape_errors_total", 0) + 1
|
||||
|
||||
def collect_dnssec_metrics(self):
|
||||
"""Collect DNSSEC status."""
|
||||
try:
|
||||
data = self._cf_request(f"/zones/{self.zone_id}/dnssec")
|
||||
result = data.get("result", {})
|
||||
status = result.get("status", "unknown")
|
||||
|
||||
self.metrics["dnssec_enabled"] = 1 if status == "active" else 0
|
||||
|
||||
except Exception:
|
||||
self.metrics["dnssec_enabled"] = -1
|
||||
|
||||
def collect_tunnel_metrics(self):
|
||||
"""Collect tunnel metrics."""
|
||||
try:
|
||||
data = self._cf_request(f"/accounts/{self.account_id}/cfd_tunnel")
|
||||
tunnels = data.get("result", [])
|
||||
|
||||
active = 0
|
||||
healthy = 0
|
||||
total_connections = 0
|
||||
|
||||
for t in tunnels:
|
||||
if not t.get("deleted_at"):
|
||||
active += 1
|
||||
# Check connections
|
||||
try:
|
||||
conn_data = self._cf_request(
|
||||
f"/accounts/{self.account_id}/cfd_tunnel/{t['id']}/connections"
|
||||
)
|
||||
conns = conn_data.get("result", [])
|
||||
if conns:
|
||||
healthy += 1
|
||||
total_connections += len(conns)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
self.metrics["tunnels_total"] = active
|
||||
self.metrics["tunnels_healthy"] = healthy
|
||||
self.metrics["tunnels_unhealthy"] = active - healthy
|
||||
self.metrics["tunnel_connections_total"] = total_connections
|
||||
|
||||
except Exception:
|
||||
self.metrics["tunnel_scrape_errors_total"] = self.metrics.get("tunnel_scrape_errors_total", 0) + 1
|
||||
|
||||
def collect_access_metrics(self):
|
||||
"""Collect Access app metrics."""
|
||||
try:
|
||||
data = self._cf_request(f"/accounts/{self.account_id}/access/apps")
|
||||
apps = data.get("result", [])
|
||||
|
||||
self.metrics["access_apps_total"] = len(apps)
|
||||
|
||||
# Count by type
|
||||
type_counts = {}
|
||||
for app in apps:
|
||||
app_type = app.get("type", "unknown")
|
||||
type_counts[app_type] = type_counts.get(app_type, 0) + 1
|
||||
|
||||
for app_type, count in type_counts.items():
|
||||
self.metrics[f"access_apps_by_type{{type=\"{app_type}\"}}"] = count
|
||||
|
||||
except Exception:
|
||||
self.metrics["access_scrape_errors_total"] = self.metrics.get("access_scrape_errors_total", 0) + 1
|
||||
|
||||
def collect_zone_settings_metrics(self):
|
||||
"""Collect zone security settings."""
|
||||
try:
|
||||
data = self._cf_request(f"/zones/{self.zone_id}/settings")
|
||||
settings = {s["id"]: s["value"] for s in data.get("result", [])}
|
||||
|
||||
# TLS settings
|
||||
ssl = settings.get("ssl", "unknown")
|
||||
self.metrics["zone_ssl_strict"] = 1 if ssl in ("strict", "full_strict") else 0
|
||||
|
||||
min_tls = settings.get("min_tls_version", "unknown")
|
||||
self.metrics["zone_tls_version_secure"] = 1 if min_tls in ("1.2", "1.3") else 0
|
||||
|
||||
# Security features
|
||||
self.metrics["zone_always_https"] = 1 if settings.get("always_use_https") == "on" else 0
|
||||
self.metrics["zone_browser_check"] = 1 if settings.get("browser_check") == "on" else 0
|
||||
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
def collect_snapshot_metrics(self):
|
||||
"""Collect metrics from state snapshots."""
|
||||
latest = self._get_latest_file(os.path.join(self.snapshot_dir, "cloudflare-*.json"))
|
||||
if not latest:
|
||||
self.metrics["snapshot_age_seconds"] = -1
|
||||
return
|
||||
|
||||
try:
|
||||
mtime = os.path.getmtime(latest)
|
||||
age = time.time() - mtime
|
||||
self.metrics["snapshot_age_seconds"] = int(age)
|
||||
|
||||
with open(latest) as f:
|
||||
snapshot = json.load(f)
|
||||
|
||||
integrity = snapshot.get("integrity", {})
|
||||
self.metrics["snapshot_merkle_root_set"] = 1 if integrity.get("merkle_root") else 0
|
||||
|
||||
except Exception:
|
||||
self.metrics["snapshot_age_seconds"] = -1
|
||||
|
||||
def collect_invariant_metrics(self):
|
||||
"""Collect metrics from invariant reports."""
|
||||
latest = self._get_latest_file(os.path.join(self.anomaly_dir, "invariant-report-*.json"))
|
||||
if not latest:
|
||||
self.metrics["invariants_total"] = 0
|
||||
self.metrics["invariants_passed"] = 0
|
||||
self.metrics["invariants_failed"] = 0
|
||||
return
|
||||
|
||||
try:
|
||||
with open(latest) as f:
|
||||
report = json.load(f)
|
||||
|
||||
summary = report.get("summary", {})
|
||||
self.metrics["invariants_total"] = summary.get("total", 0)
|
||||
self.metrics["invariants_passed"] = summary.get("passed", 0)
|
||||
self.metrics["invariants_failed"] = summary.get("failed", 0)
|
||||
self.metrics["invariants_pass_rate"] = summary.get("pass_rate", 0)
|
||||
|
||||
# Report age
|
||||
mtime = os.path.getmtime(latest)
|
||||
self.metrics["invariant_report_age_seconds"] = int(time.time() - mtime)
|
||||
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
def collect_anomaly_metrics(self):
|
||||
"""Count anomaly receipts."""
|
||||
anomaly_files = glob.glob(os.path.join(self.anomaly_dir, "anomaly-*.json"))
|
||||
self.metrics["anomalies_total"] = len(anomaly_files)
|
||||
|
||||
# Recent anomalies (last 24h)
|
||||
recent = 0
|
||||
day_ago = time.time() - 86400
|
||||
for f in anomaly_files:
|
||||
if os.path.getmtime(f) > day_ago:
|
||||
recent += 1
|
||||
self.metrics["anomalies_last_24h"] = recent
|
||||
|
||||
def collect_all(self):
|
||||
"""Collect all metrics."""
|
||||
now = time.time()
|
||||
if now - self.last_scrape < SCRAPE_INTERVAL:
|
||||
return # Rate limit
|
||||
|
||||
self.last_scrape = now
|
||||
self.metrics = {"scrape_timestamp": int(now)}
|
||||
|
||||
self.collect_dns_metrics()
|
||||
self.collect_dnssec_metrics()
|
||||
self.collect_tunnel_metrics()
|
||||
self.collect_access_metrics()
|
||||
self.collect_zone_settings_metrics()
|
||||
self.collect_snapshot_metrics()
|
||||
self.collect_invariant_metrics()
|
||||
self.collect_anomaly_metrics()
|
||||
|
||||
def format_prometheus(self) -> str:
|
||||
"""Format metrics as Prometheus exposition format."""
|
||||
lines = [
|
||||
"# HELP cloudflare_dns_records_total Total DNS records",
|
||||
"# TYPE cloudflare_dns_records_total gauge",
|
||||
"# HELP cloudflare_tunnels_total Total active tunnels",
|
||||
"# TYPE cloudflare_tunnels_total gauge",
|
||||
"# HELP cloudflare_tunnels_healthy Healthy tunnels with connections",
|
||||
"# TYPE cloudflare_tunnels_healthy gauge",
|
||||
"# HELP cloudflare_invariants_passed Invariants passing",
|
||||
"# TYPE cloudflare_invariants_passed gauge",
|
||||
"# HELP cloudflare_invariants_failed Invariants failing",
|
||||
"# TYPE cloudflare_invariants_failed gauge",
|
||||
"",
|
||||
]
|
||||
|
||||
for key, value in self.metrics.items():
|
||||
if isinstance(value, (int, float)):
|
||||
# Handle labels in key
|
||||
if "{" in key:
|
||||
lines.append(f"cloudflare_{key} {value}")
|
||||
else:
|
||||
lines.append(f"cloudflare_{key} {value}")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
class MetricsHandler(BaseHTTPRequestHandler):
|
||||
"""HTTP handler for Prometheus scrapes."""
|
||||
|
||||
collector: CloudflareMetricsCollector = None
|
||||
|
||||
def do_GET(self):
|
||||
if self.path == "/metrics":
|
||||
self.collector.collect_all()
|
||||
output = self.collector.format_prometheus()
|
||||
|
||||
self.send_response(200)
|
||||
self.send_header("Content-Type", "text/plain; charset=utf-8")
|
||||
self.end_headers()
|
||||
self.wfile.write(output.encode())
|
||||
elif self.path == "/health":
|
||||
self.send_response(200)
|
||||
self.send_header("Content-Type", "text/plain")
|
||||
self.end_headers()
|
||||
self.wfile.write(b"OK")
|
||||
else:
|
||||
self.send_response(404)
|
||||
self.end_headers()
|
||||
|
||||
def log_message(self, format, *args):
|
||||
pass # Suppress default logging
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Cloudflare Metrics Exporter")
|
||||
parser.add_argument("--port", type=int, default=DEFAULT_PORT,
|
||||
help=f"Port to listen on (default: {DEFAULT_PORT})")
|
||||
parser.add_argument("--zone-id", default=os.environ.get("CLOUDFLARE_ZONE_ID"))
|
||||
parser.add_argument("--account-id", default=os.environ.get("CLOUDFLARE_ACCOUNT_ID"))
|
||||
parser.add_argument("--snapshot-dir",
|
||||
default=os.environ.get("SNAPSHOT_DIR", "../snapshots"))
|
||||
parser.add_argument("--anomaly-dir",
|
||||
default=os.environ.get("ANOMALY_DIR", "../anomalies"))
|
||||
args = parser.parse_args()
|
||||
|
||||
api_token = os.environ.get("CLOUDFLARE_API_TOKEN")
|
||||
if not api_token:
|
||||
print("Error: CLOUDFLARE_API_TOKEN required")
|
||||
return 1
|
||||
|
||||
if not args.zone_id or not args.account_id:
|
||||
print("Error: Zone ID and Account ID required")
|
||||
return 1
|
||||
|
||||
# Initialize collector
|
||||
collector = CloudflareMetricsCollector(
|
||||
api_token, args.zone_id, args.account_id,
|
||||
args.snapshot_dir, args.anomaly_dir
|
||||
)
|
||||
MetricsHandler.collector = collector
|
||||
|
||||
# Start server
|
||||
server = HTTPServer(("0.0.0.0", args.port), MetricsHandler)
|
||||
print(f"Cloudflare Metrics Exporter listening on :{args.port}")
|
||||
print(f" /metrics - Prometheus metrics")
|
||||
print(f" /health - Health check")
|
||||
|
||||
try:
|
||||
server.serve_forever()
|
||||
except KeyboardInterrupt:
|
||||
print("\nShutting down...")
|
||||
server.shutdown()
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
exit(main())
|
||||
43
observatory/prometheus.yml
Normal file
43
observatory/prometheus.yml
Normal file
@@ -0,0 +1,43 @@
|
||||
# Prometheus Configuration for Cloudflare Mesh Observatory
|
||||
# Scrapes metrics from the custom exporter
|
||||
|
||||
global:
|
||||
scrape_interval: 60s
|
||||
evaluation_interval: 60s
|
||||
external_labels:
|
||||
monitor: 'cloudflare-mesh'
|
||||
|
||||
# Alerting configuration
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- static_configs:
|
||||
- targets:
|
||||
- alertmanager:9093
|
||||
|
||||
# Rule files - Load all alert rules from the alerts directory
|
||||
rule_files:
|
||||
- /etc/prometheus/alerts/*.yml
|
||||
|
||||
# Scrape configurations
|
||||
scrape_configs:
|
||||
# Prometheus self-monitoring
|
||||
- job_name: 'prometheus'
|
||||
static_configs:
|
||||
- targets: ['localhost:9090']
|
||||
metrics_path: /metrics
|
||||
scheme: http
|
||||
|
||||
# Cloudflare metrics exporter
|
||||
- job_name: 'cloudflare'
|
||||
static_configs:
|
||||
- targets: ['metrics-exporter:9100']
|
||||
metrics_path: /metrics
|
||||
scheme: http
|
||||
scrape_interval: 60s
|
||||
scrape_timeout: 30s
|
||||
honor_labels: true
|
||||
|
||||
# Optional: Node exporter for host metrics
|
||||
# - job_name: 'node'
|
||||
# static_configs:
|
||||
# - targets: ['node-exporter:9100']
|
||||
228
observatory/prometheus/alerts/dns-alerts.yml
Normal file
228
observatory/prometheus/alerts/dns-alerts.yml
Normal file
@@ -0,0 +1,228 @@
|
||||
# DNS Alert Rules for Cloudflare Mesh Observatory
|
||||
# Phase 5B - Alerts & Escalation
|
||||
|
||||
groups:
|
||||
- name: dns_alerts
|
||||
interval: 60s
|
||||
rules:
|
||||
# ============================================
|
||||
# CRITICAL - DNS Hijack Detection
|
||||
# ============================================
|
||||
- alert: DNSHijackDetected
|
||||
expr: cloudflare_dns_record_mismatch == 1
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
component: dns
|
||||
playbook: dns-compromise
|
||||
security_incident: "true"
|
||||
annotations:
|
||||
summary: "POTENTIAL DNS HIJACK: {{ $labels.record_name }}"
|
||||
description: |
|
||||
DNS record {{ $labels.record_name }} ({{ $labels.record_type }}) in zone
|
||||
{{ $labels.zone }} does not match expected value.
|
||||
|
||||
Expected: {{ $labels.expected_value }}
|
||||
Actual: {{ $labels.actual_value }}
|
||||
|
||||
This may indicate DNS hijacking or unauthorized modification.
|
||||
TREAT AS SECURITY INCIDENT until verified.
|
||||
impact: "Traffic may be routed to unauthorized destinations"
|
||||
runbook_url: "https://wiki.internal/playbooks/dns-compromise"
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Critical DNS Record Missing
|
||||
# ============================================
|
||||
- alert: CriticalDNSRecordMissing
|
||||
expr: cloudflare_dns_critical_record_exists == 0
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
component: dns
|
||||
playbook: dns-compromise
|
||||
annotations:
|
||||
summary: "Critical DNS record missing: {{ $labels.record_name }}"
|
||||
description: |
|
||||
Critical DNS record {{ $labels.record_name }} ({{ $labels.record_type }})
|
||||
is missing from zone {{ $labels.zone }}.
|
||||
This record is marked as critical in the DNS manifest.
|
||||
impact: "Service reachability may be affected"
|
||||
runbook_url: "https://wiki.internal/playbooks/dns-compromise"
|
||||
|
||||
# ============================================
|
||||
# WARNING - DNS Drift Detected
|
||||
# ============================================
|
||||
- alert: DNSDriftDetected
|
||||
expr: cloudflare_dns_drift_count > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: dns
|
||||
annotations:
|
||||
summary: "DNS drift detected in zone {{ $labels.zone }}"
|
||||
description: |
|
||||
{{ $value }} DNS records in zone {{ $labels.zone }} differ from
|
||||
the expected baseline configuration.
|
||||
|
||||
Run state reconciler to identify specific changes.
|
||||
runbook_url: "https://wiki.internal/playbooks/dns-compromise"
|
||||
|
||||
# ============================================
|
||||
# WARNING - DNS Record TTL Mismatch
|
||||
# ============================================
|
||||
- alert: DNSTTLMismatch
|
||||
expr: cloudflare_dns_ttl_mismatch == 1
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
component: dns
|
||||
annotations:
|
||||
summary: "DNS TTL mismatch: {{ $labels.record_name }}"
|
||||
description: |
|
||||
DNS record {{ $labels.record_name }} has unexpected TTL.
|
||||
Expected: {{ $labels.expected_ttl }}s
|
||||
Actual: {{ $labels.actual_ttl }}s
|
||||
|
||||
This may affect caching behavior and failover timing.
|
||||
|
||||
# ============================================
|
||||
# WARNING - DNS Propagation Slow
|
||||
# ============================================
|
||||
- alert: DNSPropagationSlow
|
||||
expr: cloudflare_dns_propagation_time_seconds > 300
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: dns
|
||||
annotations:
|
||||
summary: "Slow DNS propagation for {{ $labels.record_name }}"
|
||||
description: |
|
||||
DNS changes for {{ $labels.record_name }} are taking longer than
|
||||
5 minutes to propagate.
|
||||
Current propagation time: {{ $value | humanizeDuration }}
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - DNS Propagation Failed
|
||||
# ============================================
|
||||
- alert: DNSPropagationFailed
|
||||
expr: cloudflare_dns_propagation_time_seconds > 900
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
component: dns
|
||||
annotations:
|
||||
summary: "DNS propagation failed for {{ $labels.record_name }}"
|
||||
description: |
|
||||
DNS changes for {{ $labels.record_name }} have not propagated
|
||||
after 15 minutes. This may indicate a configuration issue.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Unexpected DNS Record
|
||||
# ============================================
|
||||
- alert: UnexpectedDNSRecord
|
||||
expr: cloudflare_dns_unexpected_record == 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: dns
|
||||
annotations:
|
||||
summary: "Unexpected DNS record: {{ $labels.record_name }}"
|
||||
description: |
|
||||
DNS record {{ $labels.record_name }} ({{ $labels.record_type }}) exists
|
||||
but is not defined in the DNS manifest.
|
||||
This may be an unauthorized addition.
|
||||
|
||||
# ============================================
|
||||
# INFO - DNS Record Added
|
||||
# ============================================
|
||||
- alert: DNSRecordAdded
|
||||
expr: increase(cloudflare_dns_records_total[1h]) > 0
|
||||
for: 0m
|
||||
labels:
|
||||
severity: info
|
||||
component: dns
|
||||
annotations:
|
||||
summary: "DNS record added in zone {{ $labels.zone }}"
|
||||
description: |
|
||||
{{ $value }} new DNS record(s) detected in zone {{ $labels.zone }}
|
||||
in the last hour. Verify this was authorized.
|
||||
|
||||
# ============================================
|
||||
# INFO - DNS Record Removed
|
||||
# ============================================
|
||||
- alert: DNSRecordRemoved
|
||||
expr: decrease(cloudflare_dns_records_total[1h]) > 0
|
||||
for: 0m
|
||||
labels:
|
||||
severity: info
|
||||
component: dns
|
||||
annotations:
|
||||
summary: "DNS record removed from zone {{ $labels.zone }}"
|
||||
description: |
|
||||
{{ $value }} DNS record(s) removed from zone {{ $labels.zone }}
|
||||
in the last hour. Verify this was authorized.
|
||||
|
||||
# ============================================
|
||||
# WARNING - DNSSEC Disabled
|
||||
# ============================================
|
||||
- alert: DNSSECDisabled
|
||||
expr: cloudflare_zone_dnssec_enabled == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: dns
|
||||
annotations:
|
||||
summary: "DNSSEC disabled for zone {{ $labels.zone }}"
|
||||
description: |
|
||||
DNSSEC is not enabled for zone {{ $labels.zone }}.
|
||||
This reduces protection against DNS spoofing attacks.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Zone Transfer Enabled
|
||||
# ============================================
|
||||
- alert: ZoneTransferEnabled
|
||||
expr: cloudflare_zone_axfr_enabled == 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: dns
|
||||
annotations:
|
||||
summary: "Zone transfer (AXFR) enabled for {{ $labels.zone }}"
|
||||
description: |
|
||||
Zone transfer is enabled for {{ $labels.zone }}.
|
||||
This exposes DNS records to potential enumeration.
|
||||
Disable unless explicitly required.
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - DNS Query Spike
|
||||
# ============================================
|
||||
- alert: DNSQuerySpike
|
||||
expr: |
|
||||
rate(cloudflare_dns_queries_total[5m])
|
||||
> 3 * avg_over_time(rate(cloudflare_dns_queries_total[5m])[24h:5m])
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: dns
|
||||
annotations:
|
||||
summary: "DNS query spike for zone {{ $labels.zone }}"
|
||||
description: |
|
||||
DNS queries for zone {{ $labels.zone }} are 3x above the 24-hour average.
|
||||
This may indicate a DDoS attack or misconfigured client.
|
||||
|
||||
# ============================================
|
||||
# WARNING - High DNS Error Rate
|
||||
# ============================================
|
||||
- alert: HighDNSErrorRate
|
||||
expr: |
|
||||
rate(cloudflare_dns_errors_total[5m])
|
||||
/ rate(cloudflare_dns_queries_total[5m]) > 0.01
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
component: dns
|
||||
annotations:
|
||||
summary: "High DNS error rate for zone {{ $labels.zone }}"
|
||||
description: |
|
||||
DNS error rate exceeds 1% for zone {{ $labels.zone }}.
|
||||
Current error rate: {{ $value | humanizePercentage }}
|
||||
284
observatory/prometheus/alerts/invariant-alerts.yml
Normal file
284
observatory/prometheus/alerts/invariant-alerts.yml
Normal file
@@ -0,0 +1,284 @@
|
||||
# Security Invariant Alert Rules for Cloudflare Mesh Observatory
|
||||
# Phase 5B - Alerts & Escalation
|
||||
|
||||
groups:
|
||||
- name: invariant_alerts
|
||||
interval: 60s
|
||||
rules:
|
||||
# ============================================
|
||||
# CRITICAL - SSL Mode Downgrade
|
||||
# ============================================
|
||||
- alert: SSLModeDowngraded
|
||||
expr: cloudflare_zone_ssl_mode != 1 # 1 = Full (Strict)
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
component: invariant
|
||||
invariant_name: ssl_strict_mode
|
||||
category: encryption
|
||||
frameworks: "SOC2,PCI-DSS,ISO27001"
|
||||
annotations:
|
||||
summary: "SSL mode is not Full (Strict) for {{ $labels.zone }}"
|
||||
description: |
|
||||
Zone {{ $labels.zone }} SSL mode has been changed from Full (Strict).
|
||||
Current mode: {{ $labels.ssl_mode }}
|
||||
|
||||
This weakens TLS security and may allow MITM attacks.
|
||||
This is a compliance violation for multiple frameworks.
|
||||
impact: "Reduced TLS security, potential MITM vulnerability"
|
||||
runbook_url: "https://wiki.internal/invariants/ssl-mode"
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Always Use HTTPS Disabled
|
||||
# ============================================
|
||||
- alert: HTTPSNotEnforced
|
||||
expr: cloudflare_zone_always_use_https == 0
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
component: invariant
|
||||
invariant_name: always_use_https
|
||||
category: encryption
|
||||
frameworks: "SOC2,PCI-DSS,HIPAA"
|
||||
annotations:
|
||||
summary: "Always Use HTTPS disabled for {{ $labels.zone }}"
|
||||
description: |
|
||||
Zone {{ $labels.zone }} allows HTTP traffic.
|
||||
This may expose sensitive data in transit.
|
||||
impact: "Data transmitted over unencrypted connections"
|
||||
runbook_url: "https://wiki.internal/invariants/https-enforcement"
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - TLS Version Below Minimum
|
||||
# ============================================
|
||||
- alert: TLSVersionTooLow
|
||||
expr: cloudflare_zone_min_tls_version < 1.2
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
component: invariant
|
||||
invariant_name: min_tls_version
|
||||
category: encryption
|
||||
frameworks: "PCI-DSS,NIST"
|
||||
annotations:
|
||||
summary: "Minimum TLS version below 1.2 for {{ $labels.zone }}"
|
||||
description: |
|
||||
Zone {{ $labels.zone }} allows TLS versions below 1.2.
|
||||
Current minimum: TLS {{ $labels.min_tls }}
|
||||
|
||||
TLS 1.0 and 1.1 have known vulnerabilities.
|
||||
PCI-DSS requires TLS 1.2 minimum.
|
||||
impact: "Vulnerable TLS versions allowed"
|
||||
runbook_url: "https://wiki.internal/invariants/tls-version"
|
||||
|
||||
# ============================================
|
||||
# WARNING - HSTS Not Enabled
|
||||
# ============================================
|
||||
- alert: HSTSNotEnabled
|
||||
expr: cloudflare_zone_hsts_enabled == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: invariant
|
||||
invariant_name: hsts_enabled
|
||||
category: encryption
|
||||
frameworks: "SOC2,OWASP"
|
||||
annotations:
|
||||
summary: "HSTS not enabled for {{ $labels.zone }}"
|
||||
description: |
|
||||
HTTP Strict Transport Security is not enabled for {{ $labels.zone }}.
|
||||
This allows SSL stripping attacks.
|
||||
runbook_url: "https://wiki.internal/invariants/hsts"
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Security Headers Missing
|
||||
# ============================================
|
||||
- alert: SecurityHeadersMissing
|
||||
expr: cloudflare_zone_security_headers_score < 0.8
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: invariant
|
||||
invariant_name: security_headers
|
||||
category: headers
|
||||
frameworks: "OWASP,SOC2"
|
||||
annotations:
|
||||
summary: "Security headers score below threshold for {{ $labels.zone }}"
|
||||
description: |
|
||||
Zone {{ $labels.zone }} security headers score: {{ $value }}
|
||||
Expected minimum: 0.8
|
||||
|
||||
Missing headers may include: CSP, X-Frame-Options, X-Content-Type-Options
|
||||
runbook_url: "https://wiki.internal/invariants/security-headers"
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Origin IP Exposed
|
||||
# ============================================
|
||||
- alert: OriginIPExposed
|
||||
expr: cloudflare_origin_ip_exposed == 1
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
component: invariant
|
||||
invariant_name: origin_hidden
|
||||
category: network
|
||||
frameworks: "SOC2"
|
||||
annotations:
|
||||
summary: "Origin IP may be exposed for {{ $labels.zone }}"
|
||||
description: |
|
||||
DNS or headers may be exposing the origin server IP.
|
||||
Exposed via: {{ $labels.exposure_method }}
|
||||
|
||||
Attackers can bypass Cloudflare protection by attacking origin directly.
|
||||
impact: "Origin server exposed to direct attacks"
|
||||
runbook_url: "https://wiki.internal/invariants/origin-protection"
|
||||
|
||||
# ============================================
|
||||
# WARNING - Rate Limiting Not Configured
|
||||
# ============================================
|
||||
- alert: RateLimitingMissing
|
||||
expr: cloudflare_zone_rate_limiting_rules == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: invariant
|
||||
invariant_name: rate_limiting
|
||||
category: protection
|
||||
frameworks: "SOC2,OWASP"
|
||||
annotations:
|
||||
summary: "No rate limiting rules for {{ $labels.zone }}"
|
||||
description: |
|
||||
Zone {{ $labels.zone }} has no rate limiting rules configured.
|
||||
This leaves the zone vulnerable to brute force attacks.
|
||||
runbook_url: "https://wiki.internal/invariants/rate-limiting"
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Authenticated Origin Pulls Disabled
|
||||
# ============================================
|
||||
- alert: AuthenticatedOriginPullsDisabled
|
||||
expr: cloudflare_zone_authenticated_origin_pulls == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: invariant
|
||||
invariant_name: aop_enabled
|
||||
category: authentication
|
||||
frameworks: "SOC2,Zero-Trust"
|
||||
annotations:
|
||||
summary: "Authenticated Origin Pulls disabled for {{ $labels.zone }}"
|
||||
description: |
|
||||
Authenticated Origin Pulls is not enabled for {{ $labels.zone }}.
|
||||
Origin cannot verify requests come from Cloudflare.
|
||||
runbook_url: "https://wiki.internal/invariants/authenticated-origin-pulls"
|
||||
|
||||
# ============================================
|
||||
# WARNING - Bot Protection Disabled
|
||||
# ============================================
|
||||
- alert: BotProtectionDisabled
|
||||
expr: cloudflare_zone_bot_management_enabled == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: invariant
|
||||
invariant_name: bot_management
|
||||
category: protection
|
||||
annotations:
|
||||
summary: "Bot management disabled for {{ $labels.zone }}"
|
||||
description: |
|
||||
Bot management is not enabled for {{ $labels.zone }}.
|
||||
Zone is vulnerable to automated attacks and scraping.
|
||||
runbook_url: "https://wiki.internal/invariants/bot-management"
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Access Policy Violation
|
||||
# ============================================
|
||||
- alert: AccessPolicyViolation
|
||||
expr: cloudflare_access_policy_violations > 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
component: invariant
|
||||
invariant_name: access_policy
|
||||
category: access_control
|
||||
frameworks: "SOC2,Zero-Trust,ISO27001"
|
||||
annotations:
|
||||
summary: "Access policy violations detected"
|
||||
description: |
|
||||
{{ $value }} access policy violations detected.
|
||||
Policy: {{ $labels.policy_name }}
|
||||
|
||||
Review access logs for unauthorized access attempts.
|
||||
impact: "Potential unauthorized access"
|
||||
runbook_url: "https://wiki.internal/invariants/access-control"
|
||||
|
||||
# ============================================
|
||||
# WARNING - Browser Integrity Check Disabled
|
||||
# ============================================
|
||||
- alert: BrowserIntegrityCheckDisabled
|
||||
expr: cloudflare_zone_browser_integrity_check == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: invariant
|
||||
invariant_name: browser_integrity_check
|
||||
category: protection
|
||||
annotations:
|
||||
summary: "Browser Integrity Check disabled for {{ $labels.zone }}"
|
||||
description: |
|
||||
Browser Integrity Check is disabled for {{ $labels.zone }}.
|
||||
This allows requests with suspicious headers.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Email Obfuscation Disabled
|
||||
# ============================================
|
||||
- alert: EmailObfuscationDisabled
|
||||
expr: cloudflare_zone_email_obfuscation == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: info
|
||||
component: invariant
|
||||
invariant_name: email_obfuscation
|
||||
category: privacy
|
||||
annotations:
|
||||
summary: "Email obfuscation disabled for {{ $labels.zone }}"
|
||||
description: |
|
||||
Email obfuscation is disabled. Email addresses on pages
|
||||
may be harvested by spam bots.
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Development Mode Active
|
||||
# ============================================
|
||||
- alert: DevelopmentModeActive
|
||||
expr: cloudflare_zone_development_mode == 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: invariant
|
||||
invariant_name: development_mode
|
||||
category: configuration
|
||||
annotations:
|
||||
summary: "Development mode active for {{ $labels.zone }}"
|
||||
description: |
|
||||
Development mode is active for {{ $labels.zone }}.
|
||||
This bypasses Cloudflare's cache and should only be used temporarily.
|
||||
Remember to disable after development is complete.
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Invariant Check Failure
|
||||
# ============================================
|
||||
- alert: InvariantCheckFailed
|
||||
expr: cloudflare_invariant_check_status == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
component: invariant
|
||||
category: monitoring
|
||||
annotations:
|
||||
summary: "Invariant checker is failing"
|
||||
description: |
|
||||
The invariant checker script is not running successfully.
|
||||
Last success: {{ $labels.last_success }}
|
||||
Error: {{ $labels.error_message }}
|
||||
|
||||
Security invariants are not being monitored.
|
||||
runbook_url: "https://wiki.internal/invariants/checker-troubleshooting"
|
||||
257
observatory/prometheus/alerts/proofchain-alerts.yml
Normal file
257
observatory/prometheus/alerts/proofchain-alerts.yml
Normal file
@@ -0,0 +1,257 @@
|
||||
# Proofchain Alert Rules for Cloudflare Mesh Observatory
|
||||
# Phase 5B - Alerts & Escalation
|
||||
|
||||
groups:
|
||||
- name: proofchain_alerts
|
||||
interval: 60s
|
||||
rules:
|
||||
# ============================================
|
||||
# CRITICAL - Chain Integrity Failure
|
||||
# ============================================
|
||||
- alert: ProofchainIntegrityFailure
|
||||
expr: cloudflare_proofchain_integrity_valid == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
component: proofchain
|
||||
security_incident: "true"
|
||||
annotations:
|
||||
summary: "CRITICAL: Proofchain integrity verification FAILED"
|
||||
description: |
|
||||
Proofchain {{ $labels.chain_name }} has failed integrity verification.
|
||||
|
||||
Last valid hash: {{ $labels.last_valid_hash }}
|
||||
Expected hash: {{ $labels.expected_hash }}
|
||||
Computed hash: {{ $labels.computed_hash }}
|
||||
|
||||
This indicates potential:
|
||||
- Ledger tampering
|
||||
- Receipt corruption
|
||||
- Chain fork
|
||||
|
||||
IMMEDIATELY HALT new receipt generation until resolved.
|
||||
impact: "Audit trail integrity compromised"
|
||||
runbook_url: "https://wiki.internal/playbooks/proofchain-incident"
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Receipt Hash Mismatch
|
||||
# ============================================
|
||||
- alert: ReceiptHashMismatch
|
||||
expr: cloudflare_receipt_hash_valid == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
component: proofchain
|
||||
security_incident: "true"
|
||||
annotations:
|
||||
summary: "Receipt hash mismatch detected"
|
||||
description: |
|
||||
Receipt {{ $labels.receipt_id }} ({{ $labels.receipt_type }})
|
||||
hash does not match stored value.
|
||||
|
||||
This receipt may have been modified after creation.
|
||||
Investigate for potential tampering.
|
||||
runbook_url: "https://wiki.internal/playbooks/proofchain-incident"
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Anchor Missing
|
||||
# ============================================
|
||||
- alert: ProofchainAnchorMissing
|
||||
expr: cloudflare_proofchain_anchor_age_hours > 24
|
||||
for: 1h
|
||||
labels:
|
||||
severity: critical
|
||||
component: proofchain
|
||||
annotations:
|
||||
summary: "Proofchain anchor overdue"
|
||||
description: |
|
||||
No proofchain anchor has been created in {{ $value | humanize }} hours.
|
||||
Anchors should be created at least daily.
|
||||
|
||||
This weakens the audit trail's immutability guarantees.
|
||||
runbook_url: "https://wiki.internal/playbooks/proofchain-maintenance"
|
||||
|
||||
# ============================================
|
||||
# WARNING - Receipt Generation Failed
|
||||
# ============================================
|
||||
- alert: ReceiptGenerationFailed
|
||||
expr: increase(cloudflare_receipt_generation_failures_total[1h]) > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: proofchain
|
||||
annotations:
|
||||
summary: "Receipt generation failures detected"
|
||||
description: |
|
||||
{{ $value }} receipt generation failures in the last hour.
|
||||
Receipt type: {{ $labels.receipt_type }}
|
||||
Error: {{ $labels.error_type }}
|
||||
|
||||
Operations are proceeding but not being properly logged.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Chain Growth Stalled
|
||||
# ============================================
|
||||
- alert: ProofchainGrowthStalled
|
||||
expr: increase(cloudflare_proofchain_receipts_total[6h]) == 0
|
||||
for: 6h
|
||||
labels:
|
||||
severity: warning
|
||||
component: proofchain
|
||||
annotations:
|
||||
summary: "No new receipts in 6 hours"
|
||||
description: |
|
||||
Proofchain {{ $labels.chain_name }} has not received new receipts
|
||||
in 6 hours. This may indicate:
|
||||
- Receipt generation failure
|
||||
- System not operational
|
||||
- Configuration issue
|
||||
|
||||
Verify receipt generation is working.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Chain Drift from Root
|
||||
# ============================================
|
||||
- alert: ProofchainDrift
|
||||
expr: cloudflare_proofchain_drift_receipts > 100
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
component: proofchain
|
||||
annotations:
|
||||
summary: "Proofchain has {{ $value }} unanchored receipts"
|
||||
description: |
|
||||
Chain {{ $labels.chain_name }} has {{ $value }} receipts since
|
||||
the last anchor. Consider creating a new anchor to checkpoint
|
||||
the current state.
|
||||
|
||||
# ============================================
|
||||
# INFO - Anchor Created
|
||||
# ============================================
|
||||
- alert: ProofchainAnchorCreated
|
||||
expr: changes(cloudflare_proofchain_anchor_count[1h]) > 0
|
||||
for: 0m
|
||||
labels:
|
||||
severity: info
|
||||
component: proofchain
|
||||
annotations:
|
||||
summary: "New proofchain anchor created"
|
||||
description: |
|
||||
A new anchor has been created for chain {{ $labels.chain_name }}.
|
||||
Anchor hash: {{ $labels.anchor_hash }}
|
||||
Receipts anchored: {{ $labels.receipts_anchored }}
|
||||
|
||||
# ============================================
|
||||
# WARNING - Frontier Corruption
|
||||
# ============================================
|
||||
- alert: ProofchainFrontierCorrupt
|
||||
expr: cloudflare_proofchain_frontier_valid == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
component: proofchain
|
||||
annotations:
|
||||
summary: "Proofchain frontier is corrupt"
|
||||
description: |
|
||||
The frontier (latest state) of chain {{ $labels.chain_name }}
|
||||
cannot be verified. The chain may be in an inconsistent state.
|
||||
|
||||
Do not append new receipts until this is resolved.
|
||||
runbook_url: "https://wiki.internal/playbooks/proofchain-incident"
|
||||
|
||||
# ============================================
|
||||
# WARNING - Receipt Backlog
|
||||
# ============================================
|
||||
- alert: ReceiptBacklog
|
||||
expr: cloudflare_receipt_queue_depth > 100
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
component: proofchain
|
||||
annotations:
|
||||
summary: "Receipt generation backlog"
|
||||
description: |
|
||||
{{ $value }} receipts waiting to be written.
|
||||
This may indicate performance issues or blocked writes.
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Receipt Queue Overflow
|
||||
# ============================================
|
||||
- alert: ReceiptQueueOverflow
|
||||
expr: cloudflare_receipt_queue_depth > 1000
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
component: proofchain
|
||||
annotations:
|
||||
summary: "Receipt queue overflow imminent"
|
||||
description: |
|
||||
{{ $value }} receipts in queue. Queue may overflow.
|
||||
Some operational events may not be recorded.
|
||||
Investigate and resolve immediately.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Receipt Write Latency High
|
||||
# ============================================
|
||||
- alert: ReceiptWriteLatencyHigh
|
||||
expr: cloudflare_receipt_write_duration_seconds > 5
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: proofchain
|
||||
annotations:
|
||||
summary: "High receipt write latency"
|
||||
description: |
|
||||
Receipt write operations taking {{ $value | humanize }}s.
|
||||
This may cause backlog buildup.
|
||||
Check storage performance.
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Storage Near Capacity
|
||||
# ============================================
|
||||
- alert: ProofchainStorageNearFull
|
||||
expr: cloudflare_proofchain_storage_used_bytes / cloudflare_proofchain_storage_total_bytes > 0.9
|
||||
for: 1h
|
||||
labels:
|
||||
severity: critical
|
||||
component: proofchain
|
||||
annotations:
|
||||
summary: "Proofchain storage >90% full"
|
||||
description: |
|
||||
Proofchain storage is {{ $value | humanizePercentage }} full.
|
||||
Expand storage or archive old receipts immediately.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Cross-Ledger Verification Failed
|
||||
# ============================================
|
||||
- alert: CrossLedgerVerificationFailed
|
||||
expr: cloudflare_proofchain_cross_verification_valid == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: proofchain
|
||||
annotations:
|
||||
summary: "Cross-ledger verification failed"
|
||||
description: |
|
||||
Verification between {{ $labels.chain_a }} and {{ $labels.chain_b }}
|
||||
has failed. The ledgers may have diverged.
|
||||
|
||||
Investigate the root cause before proceeding.
|
||||
|
||||
# ============================================
|
||||
# INFO - Receipt Type Distribution Anomaly
|
||||
# ============================================
|
||||
- alert: ReceiptDistributionAnomaly
|
||||
expr: |
|
||||
(rate(cloudflare_receipts_by_type_total{type="anomaly"}[1h])
|
||||
/ rate(cloudflare_receipts_by_type_total[1h])) > 0.5
|
||||
for: 1h
|
||||
labels:
|
||||
severity: info
|
||||
component: proofchain
|
||||
annotations:
|
||||
summary: "High proportion of anomaly receipts"
|
||||
description: |
|
||||
More than 50% of recent receipts are anomaly type.
|
||||
This may indicate systemic issues being logged.
|
||||
Review recent anomaly receipts for patterns.
|
||||
210
observatory/prometheus/alerts/tunnel-alerts.yml
Normal file
210
observatory/prometheus/alerts/tunnel-alerts.yml
Normal file
@@ -0,0 +1,210 @@
|
||||
# Tunnel Alert Rules for Cloudflare Mesh Observatory
|
||||
# Phase 5B - Alerts & Escalation
|
||||
|
||||
groups:
|
||||
- name: tunnel_alerts
|
||||
interval: 30s
|
||||
rules:
|
||||
# ============================================
|
||||
# CRITICAL - Tunnel Down
|
||||
# ============================================
|
||||
- alert: TunnelDown
|
||||
expr: cloudflare_tunnel_status == 0
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
component: tunnel
|
||||
playbook: tunnel-rotation
|
||||
annotations:
|
||||
summary: "Cloudflare Tunnel {{ $labels.tunnel_name }} is DOWN"
|
||||
description: |
|
||||
Tunnel {{ $labels.tunnel_name }} (ID: {{ $labels.tunnel_id }}) has been
|
||||
unreachable for more than 2 minutes. Services behind this tunnel are
|
||||
likely unreachable.
|
||||
impact: "Services behind tunnel are unreachable from the internet"
|
||||
runbook_url: "https://wiki.internal/playbooks/tunnel-rotation"
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - All Tunnels Down
|
||||
# ============================================
|
||||
- alert: AllTunnelsDown
|
||||
expr: count(cloudflare_tunnel_status == 1) == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
component: tunnel
|
||||
playbook: tunnel-rotation
|
||||
annotations:
|
||||
summary: "ALL Cloudflare Tunnels are DOWN"
|
||||
description: |
|
||||
No healthy tunnels detected. Complete loss of tunnel connectivity.
|
||||
This is a P0 incident requiring immediate attention.
|
||||
impact: "Complete loss of external connectivity via tunnels"
|
||||
runbook_url: "https://wiki.internal/playbooks/tunnel-rotation"
|
||||
|
||||
# ============================================
|
||||
# WARNING - Tunnel Degraded
|
||||
# ============================================
|
||||
- alert: TunnelDegraded
|
||||
expr: cloudflare_tunnel_connections < 2
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: tunnel
|
||||
annotations:
|
||||
summary: "Tunnel {{ $labels.tunnel_name }} has reduced connections"
|
||||
description: |
|
||||
Tunnel {{ $labels.tunnel_name }} has fewer than 2 active connections.
|
||||
This may indicate network issues or cloudflared problems.
|
||||
runbook_url: "https://wiki.internal/playbooks/tunnel-rotation"
|
||||
|
||||
# ============================================
|
||||
# WARNING - Tunnel Rotation Due
|
||||
# ============================================
|
||||
- alert: TunnelRotationDue
|
||||
expr: (time() - cloudflare_tunnel_created_timestamp) > (86400 * 30)
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
component: tunnel
|
||||
playbook: tunnel-rotation
|
||||
annotations:
|
||||
summary: "Tunnel {{ $labels.tunnel_name }} rotation is due"
|
||||
description: |
|
||||
Tunnel {{ $labels.tunnel_name }} was created more than 30 days ago.
|
||||
Per security policy, tunnels should be rotated monthly.
|
||||
Age: {{ $value | humanizeDuration }}
|
||||
runbook_url: "https://wiki.internal/playbooks/tunnel-rotation"
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Tunnel Rotation Overdue
|
||||
# ============================================
|
||||
- alert: TunnelRotationOverdue
|
||||
expr: (time() - cloudflare_tunnel_created_timestamp) > (86400 * 45)
|
||||
for: 1h
|
||||
labels:
|
||||
severity: critical
|
||||
component: tunnel
|
||||
playbook: tunnel-rotation
|
||||
annotations:
|
||||
summary: "Tunnel {{ $labels.tunnel_name }} rotation is OVERDUE"
|
||||
description: |
|
||||
Tunnel {{ $labels.tunnel_name }} is more than 45 days old.
|
||||
This exceeds the maximum rotation interval and represents a
|
||||
security policy violation.
|
||||
Age: {{ $value | humanizeDuration }}
|
||||
runbook_url: "https://wiki.internal/playbooks/tunnel-rotation"
|
||||
|
||||
# ============================================
|
||||
# WARNING - Tunnel High Latency
|
||||
# ============================================
|
||||
- alert: TunnelHighLatency
|
||||
expr: cloudflare_tunnel_latency_ms > 500
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: tunnel
|
||||
annotations:
|
||||
summary: "High latency on tunnel {{ $labels.tunnel_name }}"
|
||||
description: |
|
||||
Tunnel {{ $labels.tunnel_name }} is experiencing latency above 500ms.
|
||||
Current latency: {{ $value }}ms
|
||||
This may impact user experience.
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Tunnel Very High Latency
|
||||
# ============================================
|
||||
- alert: TunnelVeryHighLatency
|
||||
expr: cloudflare_tunnel_latency_ms > 2000
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
component: tunnel
|
||||
annotations:
|
||||
summary: "Critical latency on tunnel {{ $labels.tunnel_name }}"
|
||||
description: |
|
||||
Tunnel {{ $labels.tunnel_name }} latency exceeds 2000ms.
|
||||
Current latency: {{ $value }}ms
|
||||
Services may be timing out.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Tunnel Error Rate High
|
||||
# ============================================
|
||||
- alert: TunnelHighErrorRate
|
||||
expr: |
|
||||
rate(cloudflare_tunnel_errors_total[5m])
|
||||
/ rate(cloudflare_tunnel_requests_total[5m]) > 0.05
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: tunnel
|
||||
annotations:
|
||||
summary: "High error rate on tunnel {{ $labels.tunnel_name }}"
|
||||
description: |
|
||||
Tunnel {{ $labels.tunnel_name }} error rate exceeds 5%.
|
||||
Current error rate: {{ $value | humanizePercentage }}
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Tunnel Error Rate Critical
|
||||
# ============================================
|
||||
- alert: TunnelCriticalErrorRate
|
||||
expr: |
|
||||
rate(cloudflare_tunnel_errors_total[5m])
|
||||
/ rate(cloudflare_tunnel_requests_total[5m]) > 0.20
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
component: tunnel
|
||||
annotations:
|
||||
summary: "Critical error rate on tunnel {{ $labels.tunnel_name }}"
|
||||
description: |
|
||||
Tunnel {{ $labels.tunnel_name }} error rate exceeds 20%.
|
||||
Current error rate: {{ $value | humanizePercentage }}
|
||||
This indicates severe connectivity issues.
|
||||
|
||||
# ============================================
|
||||
# INFO - Tunnel Configuration Changed
|
||||
# ============================================
|
||||
- alert: TunnelConfigChanged
|
||||
expr: changes(cloudflare_tunnel_config_hash[1h]) > 0
|
||||
for: 0m
|
||||
labels:
|
||||
severity: info
|
||||
component: tunnel
|
||||
annotations:
|
||||
summary: "Tunnel {{ $labels.tunnel_name }} configuration changed"
|
||||
description: |
|
||||
The configuration for tunnel {{ $labels.tunnel_name }} has changed
|
||||
in the last hour. Verify this was an authorized change.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Cloudflared Version Outdated
|
||||
# ============================================
|
||||
- alert: CloudflaredOutdated
|
||||
expr: cloudflare_cloudflared_version_age_days > 90
|
||||
for: 24h
|
||||
labels:
|
||||
severity: warning
|
||||
component: tunnel
|
||||
annotations:
|
||||
summary: "cloudflared version is outdated"
|
||||
description: |
|
||||
The cloudflared binary is more than 90 days old.
|
||||
Current version age: {{ $value }} days
|
||||
Consider upgrading to latest version for security patches.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Tunnel Connection Flapping
|
||||
# ============================================
|
||||
- alert: TunnelConnectionFlapping
|
||||
expr: changes(cloudflare_tunnel_status[10m]) > 3
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
component: tunnel
|
||||
annotations:
|
||||
summary: "Tunnel {{ $labels.tunnel_name }} is flapping"
|
||||
description: |
|
||||
Tunnel {{ $labels.tunnel_name }} has changed state {{ $value }} times
|
||||
in the last 10 minutes. This indicates instability.
|
||||
Check network connectivity and cloudflared logs.
|
||||
266
observatory/prometheus/alerts/waf-alerts.yml
Normal file
266
observatory/prometheus/alerts/waf-alerts.yml
Normal file
@@ -0,0 +1,266 @@
|
||||
# WAF Alert Rules for Cloudflare Mesh Observatory
|
||||
# Phase 5B - Alerts & Escalation
|
||||
|
||||
groups:
|
||||
- name: waf_alerts
|
||||
interval: 30s
|
||||
rules:
|
||||
# ============================================
|
||||
# CRITICAL - Massive Attack Detected
|
||||
# ============================================
|
||||
- alert: WAFMassiveAttack
|
||||
expr: |
|
||||
rate(cloudflare_waf_blocked_requests_total[5m]) > 1000
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
component: waf
|
||||
playbook: waf-incident
|
||||
annotations:
|
||||
summary: "Massive attack detected - {{ $value | humanize }} blocks/sec"
|
||||
description: |
|
||||
WAF is blocking more than 1000 requests per second.
|
||||
This indicates a significant attack in progress.
|
||||
|
||||
Consider enabling Under Attack Mode if not already active.
|
||||
impact: "Potential service degradation under attack load"
|
||||
current_mitigation: "WAF blocking enabled"
|
||||
runbook_url: "https://wiki.internal/playbooks/waf-incident"
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - WAF Rule Bypass Detected
|
||||
# ============================================
|
||||
- alert: WAFRuleBypass
|
||||
expr: cloudflare_waf_bypass_detected == 1
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
component: waf
|
||||
playbook: waf-incident
|
||||
security_incident: "true"
|
||||
annotations:
|
||||
summary: "WAF rule bypass detected for rule {{ $labels.rule_id }}"
|
||||
description: |
|
||||
Malicious traffic matching known attack patterns has bypassed
|
||||
WAF rule {{ $labels.rule_id }}.
|
||||
|
||||
Attack type: {{ $labels.attack_type }}
|
||||
Bypassed requests: {{ $labels.bypass_count }}
|
||||
|
||||
Review and tighten rule immediately.
|
||||
runbook_url: "https://wiki.internal/playbooks/waf-incident"
|
||||
|
||||
# ============================================
|
||||
# WARNING - Attack Spike
|
||||
# ============================================
|
||||
- alert: WAFAttackSpike
|
||||
expr: |
|
||||
rate(cloudflare_waf_blocked_requests_total[5m])
|
||||
> 5 * avg_over_time(rate(cloudflare_waf_blocked_requests_total[5m])[24h:5m])
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: waf
|
||||
annotations:
|
||||
summary: "WAF block rate 5x above normal"
|
||||
description: |
|
||||
WAF is blocking significantly more requests than the 24-hour average.
|
||||
Current rate: {{ $value | humanize }}/s
|
||||
|
||||
This may indicate an attack or new attack pattern.
|
||||
|
||||
# ============================================
|
||||
# WARNING - SQL Injection Attempts
|
||||
# ============================================
|
||||
- alert: WAFSQLiAttack
|
||||
expr: rate(cloudflare_waf_sqli_blocks_total[5m]) > 10
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
component: waf
|
||||
attack_type: sqli
|
||||
annotations:
|
||||
summary: "SQL injection attack detected"
|
||||
description: |
|
||||
WAF is blocking SQL injection attempts at {{ $value | humanize }}/s.
|
||||
Source IPs may need to be blocked at firewall level.
|
||||
|
||||
# ============================================
|
||||
# WARNING - XSS Attempts
|
||||
# ============================================
|
||||
- alert: WAFXSSAttack
|
||||
expr: rate(cloudflare_waf_xss_blocks_total[5m]) > 10
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
component: waf
|
||||
attack_type: xss
|
||||
annotations:
|
||||
summary: "XSS attack detected"
|
||||
description: |
|
||||
WAF is blocking cross-site scripting attempts at {{ $value | humanize }}/s.
|
||||
Review application input validation.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Bot Attack
|
||||
# ============================================
|
||||
- alert: WAFBotAttack
|
||||
expr: rate(cloudflare_waf_bot_blocks_total[5m]) > 100
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: waf
|
||||
attack_type: bot
|
||||
annotations:
|
||||
summary: "High bot traffic detected"
|
||||
description: |
|
||||
WAF is blocking bot traffic at {{ $value | humanize }}/s.
|
||||
Consider enabling Bot Fight Mode or stricter challenges.
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - Rate Limit Exhaustion
|
||||
# ============================================
|
||||
- alert: WAFRateLimitExhausted
|
||||
expr: cloudflare_waf_rate_limit_triggered == 1
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
component: waf
|
||||
annotations:
|
||||
summary: "Rate limit triggered for {{ $labels.rule_name }}"
|
||||
description: |
|
||||
Rate limiting rule {{ $labels.rule_name }} has been triggered.
|
||||
Source: {{ $labels.source_ip }}
|
||||
Requests blocked: {{ $labels.blocked_count }}
|
||||
|
||||
Legitimate users may be affected.
|
||||
|
||||
# ============================================
|
||||
# WARNING - WAF Rule Disabled
|
||||
# ============================================
|
||||
- alert: WAFRuleDisabled
|
||||
expr: cloudflare_waf_rule_enabled == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: waf
|
||||
annotations:
|
||||
summary: "WAF rule {{ $labels.rule_id }} is disabled"
|
||||
description: |
|
||||
WAF rule {{ $labels.rule_id }} ({{ $labels.rule_name }}) is currently disabled.
|
||||
Verify this is intentional and not a misconfiguration.
|
||||
|
||||
# ============================================
|
||||
# WARNING - WAF Mode Changed
|
||||
# ============================================
|
||||
- alert: WAFModeChanged
|
||||
expr: changes(cloudflare_waf_mode[1h]) > 0
|
||||
for: 0m
|
||||
labels:
|
||||
severity: warning
|
||||
component: waf
|
||||
annotations:
|
||||
summary: "WAF mode changed for zone {{ $labels.zone }}"
|
||||
description: |
|
||||
WAF operation mode has changed in the last hour.
|
||||
New mode: {{ $labels.mode }}
|
||||
Verify this was an authorized change.
|
||||
|
||||
# ============================================
|
||||
# INFO - Under Attack Mode Active
|
||||
# ============================================
|
||||
- alert: UnderAttackModeActive
|
||||
expr: cloudflare_zone_under_attack == 1
|
||||
for: 0m
|
||||
labels:
|
||||
severity: info
|
||||
component: waf
|
||||
annotations:
|
||||
summary: "Under Attack Mode is ACTIVE for {{ $labels.zone }}"
|
||||
description: |
|
||||
Under Attack Mode is currently enabled for zone {{ $labels.zone }}.
|
||||
This adds a JavaScript challenge to all visitors.
|
||||
Remember to disable when attack subsides.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Under Attack Mode Extended
|
||||
# ============================================
|
||||
- alert: UnderAttackModeExtended
|
||||
expr: cloudflare_zone_under_attack == 1
|
||||
for: 2h
|
||||
labels:
|
||||
severity: warning
|
||||
component: waf
|
||||
annotations:
|
||||
summary: "Under Attack Mode active for 2+ hours"
|
||||
description: |
|
||||
Under Attack Mode has been active for {{ $labels.zone }} for more
|
||||
than 2 hours. Verify it's still needed as it impacts user experience.
|
||||
|
||||
# ============================================
|
||||
# CRITICAL - WAF Completely Disabled
|
||||
# ============================================
|
||||
- alert: WAFDisabled
|
||||
expr: cloudflare_waf_enabled == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
component: waf
|
||||
annotations:
|
||||
summary: "WAF is DISABLED for zone {{ $labels.zone }}"
|
||||
description: |
|
||||
The Web Application Firewall is completely disabled for {{ $labels.zone }}.
|
||||
This leaves the zone unprotected against application-layer attacks.
|
||||
|
||||
Enable immediately unless there's a documented exception.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Low WAF Efficacy
|
||||
# ============================================
|
||||
- alert: WAFLowEfficacy
|
||||
expr: |
|
||||
cloudflare_waf_blocked_requests_total
|
||||
/ cloudflare_waf_analyzed_requests_total < 0.001
|
||||
for: 1h
|
||||
labels:
|
||||
severity: info
|
||||
component: waf
|
||||
annotations:
|
||||
summary: "Low WAF block rate for {{ $labels.zone }}"
|
||||
description: |
|
||||
WAF is blocking very few requests (< 0.1%).
|
||||
This might indicate rules are too permissive or
|
||||
the zone is not receiving attack traffic.
|
||||
|
||||
# ============================================
|
||||
# WARNING - Firewall Rule Missing
|
||||
# ============================================
|
||||
- alert: FirewallRuleMissing
|
||||
expr: cloudflare_firewall_critical_rule_exists == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
component: waf
|
||||
annotations:
|
||||
summary: "Critical firewall rule missing: {{ $labels.rule_name }}"
|
||||
description: |
|
||||
Expected firewall rule {{ $labels.rule_name }} is not configured.
|
||||
This rule is marked as critical in the WAF baseline.
|
||||
|
||||
# ============================================
|
||||
# WARNING - High False Positive Rate
|
||||
# ============================================
|
||||
- alert: WAFHighFalsePositives
|
||||
expr: |
|
||||
rate(cloudflare_waf_false_positives_total[1h])
|
||||
/ rate(cloudflare_waf_blocked_requests_total[1h]) > 0.1
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
component: waf
|
||||
annotations:
|
||||
summary: "High WAF false positive rate"
|
||||
description: |
|
||||
WAF false positive rate exceeds 10%.
|
||||
Current rate: {{ $value | humanizePercentage }}
|
||||
Review and tune rules to reduce legitimate traffic blocking.
|
||||
Reference in New Issue
Block a user