Phase 0 Operational Monitoring
Status: Configured 2026-02-14 Applies to: Phase 0 pilot deployment on K3s
Overview
ICN daemon exposes Prometheus metrics on port 9100. ServiceMonitor and PrometheusRule resources configure automated scraping and alerting for critical security and operational events.
Alert Coverage
1. Byzantine Detection (icn.byzantine_detection)
| Alert | Metric | Threshold | Severity | Meaning |
|---|---|---|---|---|
ICNByzantineNodeQuarantined |
icn_misbehavior_quarantined_peers |
> 0 for 1m | warning | Node quarantined for protocol violations |
ICNByzantineNodeAutoBanned |
increase(icn_misbehavior_auto_bans_total[5m]) |
> 0 | critical | Critical violation triggered auto-ban |
ICNHighViolationRate |
rate(icn_misbehavior_violations_total[5m]) |
> 1/sec for 5m | warning | High Byzantine violation rate |
Test:
# Trigger violation metric (requires special test mode)
kubectl exec -it -n icn deployment/icn-daemon -- \
curl -X POST localhost:9100/test/trigger_violation
2. Authentication & Authorization (icn.auth_security)
| Alert | Metric | Threshold | Severity | Meaning |
|---|---|---|---|---|
ICNHighAuthFailureRate |
rate(icn_gateway_auth_failures_total[5m]) |
> 0.5/sec for 2m | warning | Many failed login attempts |
ICNAuthenticationAttack |
rate(icn_gateway_auth_failures_total[1m]) |
> 5/sec | critical | Possible credential stuffing attack |
ICNAuthorizationFailures |
rate(icn_gateway_authorization_failures_total[5m]) |
> 1/sec for 2m | warning | Users hitting permission boundaries |
ICNRateLimitAttack |
rate(icn_gateway_rate_limit_exceeded_total[1m]) |
> 10/sec for 1m | warning | Excessive API abuse |
Test:
# Trigger auth failure metric
for i in {1..10}; do
curl -X POST http://gateway:8080/v1/auth/verify \
-H "Content-Type: application/json" \
-d '{"did": "did:icn:invalid", "challenge_id": "bad", "signature": "fake"}'
done
# Check metric
curl -s http://gateway:9100/metrics | grep icn_gateway_auth_failures_total
3. Network Health (icn.network_health)
| Alert | Metric | Threshold | Severity | Meaning |
|---|---|---|---|---|
ICNNetworkPartition |
icn_network_connections_active |
< 2 for 2m | warning | Possible network partition |
ICNNodeIsolated |
icn_network_connections_active |
== 0 for 1m | critical | Node has no network connections |
ICNHighMessageFailureRate |
rate(icn_network_messages_failed_total[5m]) |
> 0.1/sec for 5m | warning | Network message failures |
Test:
# Check current connections
curl -s http://gateway:9100/metrics | grep icn_network_connections_active
4. Ledger Consistency (icn.ledger_consistency)
| Alert | Metric | Threshold | Severity | Meaning |
|---|---|---|---|---|
ICNLedgerEntriesQuarantined |
icn_ledger_entries_quarantined |
> 0 for 1m | critical | Ledger fork detected |
ICNLedgerBalanceInconsistency |
abs(icn_ledger_balances_total) |
> 0.01 | critical | Sum of balances non-zero |
Test:
# Check ledger balance sum
curl -s http://gateway:9100/metrics | grep icn_ledger_balances_total
5. Compute Layer (icn.compute_layer)
| Alert | Metric | Threshold | Severity | Meaning |
|---|---|---|---|---|
ICNComputeTaskTimeout |
rate(icn_compute_tasks_timeout_total[5m]) |
> 0.1/sec for 5m | warning | Compute tasks timing out |
ICNComputeSignatureFailures |
rate(icn_compute_signatures_invalid_total[5m]) |
> 0 | critical | Invalid compute signatures (Byzantine executor) |
6. System Resources (icn.system_resources)
| Alert | Metric | Threshold | Severity | Meaning |
|---|---|---|---|---|
ICNHighMemoryUsage |
process_resident_memory_bytes / 1GB |
> 2GB for 10m | warning | High memory usage |
ICNMemoryLeak |
rate(process_resident_memory_bytes[1h]) |
> 1MB/sec for 1h | critical | Possible memory leak |
Accessing Monitoring
Prometheus
K3s NodePort:
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090
Query Examples:
# Auth failures in last 5 minutes
rate(icn_gateway_auth_failures_total[5m])
# Byzantine violations by type
sum by (violation_type) (icn_misbehavior_violations_total)
# Active network connections
icn_network_connections_active
# Governance proposals closed
icn_gateway_governance_proposals_closed_total
Grafana
K3s NodePort:
kubectl port-forward -n monitoring svc/grafana 3000:80
# Open http://localhost:3000
# Login: admin / <password from secret>
Get Grafana password:
kubectl get secret -n monitoring grafana -o jsonpath="{.data.admin-password}" | base64 -d
Pre-configured Dashboards:
- ICN Overview (dashboard ID:
icn-overview) - Byzantine Detection (dashboard ID:
icn-byzantine) - Gateway API Metrics (dashboard ID:
icn-gateway)
Alert Routing
Alerts are sent to:
- Prometheus Alertmanager (if configured)
- K3s Events (via kube-state-metrics)
- Logs (via tracing span)
To configure Slack/Email alerting, edit deploy/k8s/alertmanager-config.yaml.
Testing Alert Flow
1. Test Auth Failure Alert
# Generate 10 auth failures quickly
for i in {1..10}; do
curl -X POST http://10.8.30.40:30080/v1/auth/verify \
-H "Content-Type: application/json" \
-d "{\"did\": \"did:icn:test$i\", \"challenge_id\": \"bad\", \"signature\": \"fake\"}" &
done
wait
# Wait 2 minutes (for: 2m)
sleep 120
# Check Prometheus for firing alert
curl -s 'http://localhost:9090/api/v1/alerts' | jq '.data.alerts[] | select(.labels.alertname == "ICNHighAuthFailureRate")'
2. Test Byzantine Quarantine Alert
# Check metric exists
kubectl exec -n icn deployment/icn-daemon -- \
curl -s localhost:9100/metrics | grep icn_misbehavior_quarantined_peers
# If metric > 0, alert should fire after 1 minute
3. Test Signature Failure Alert
# This requires actual compute task execution with invalid signature
# See Phase 0 compute validation tests
Metrics Endpoint
Direct access to metrics:
# From within cluster
kubectl exec -n icn deployment/icn-daemon -- curl -s localhost:9100/metrics
# Via port-forward
kubectl port-forward -n icn deployment/icn-daemon 9100:9100
curl http://localhost:9100/metrics
Key Metric Families:
icn_gateway_auth_*— Authentication/authorizationicn_misbehavior_*— Byzantine detectionicn_network_*— Network connections and messagesicn_ledger_*— Ledger operations and consistencyicn_compute_*— Distributed computeicn_gateway_governance_*— Governance proposals and votesprocess_*— System resource usage (from Prometheus)
Troubleshooting
Alert Not Firing
Check metric is exposed:
kubectl exec -n icn deployment/icn-daemon -- \ curl -s localhost:9100/metrics | grep <metric_name>Check ServiceMonitor is active:
kubectl get servicemonitor -n icn icn-daemon kubectl describe servicemonitor -n icn icn-daemonCheck Prometheus target:
# Open http://localhost:9090/targets # Look for icn/icn-daemon/0 targetCheck PrometheusRule:
kubectl get prometheusrule -n monitoring icn-alerts kubectl describe prometheusrule -n monitoring icn-alerts
Metric Not Incrementing
Verify code path executes:
kubectl logs -n icn deployment/icn-daemon | grep "auth failure\|quarantine\|signature"Check metric initialization:
# Metrics must be described in init_descriptions() # See icn-obs/src/metrics/gateway.rsForce metric increment (if test endpoint exists):
curl -X POST http://gateway:9100/test/increment_metric \ -H "Content-Type: application/json" \ -d '{"metric": "icn_gateway_auth_failures_total", "labels": {"reason": "test"}}'
High Cardinality Warning
If you see:
FX clearing balance metric skipped: currency cardinality limit (100) reached
This is expected cardinality protection. Not a failure.
Phase 0 Checklist
- ServiceMonitor deployed and scraping metrics
- PrometheusRule configured with Phase 0 alerts
- Byzantine quarantine alerts active
- Auth failure alerts active
- Signature validation alerts active
- Network partition alerts active
- Ledger consistency alerts active
- Grafana dashboards configured
- Alert routing to Slack/Email (optional for demo)
- Simulated violation test passed
- Simulated auth attack test passed