ICN Troubleshooting Runbooks
This document provides step-by-step procedures for diagnosing and resolving common operational issues in ICN deployments.
Audience: ICN node operators, system administrators, SRE teams
Related Documents:
- Incident Response Playbook - For security incidents and critical failures
- Production Hardening - Security configuration reference
Note: Commands marked with "(future feature)" are planned but not yet implemented. These are included for completeness and will be available in future ICN releases.
Table of Contents
- Runbook Template
- High Memory Usage
- Gossip Convergence Failure
- Ledger Sync Lag
- Trust Computation Errors
- Gateway Rate Limiting
- Node Won't Start
- Quick Reference: Key Metrics
- Escalation Paths
Runbook Template
Each runbook follows this standard structure:
## [Issue Name]
**Severity**: P1-P3
**Typical Duration**: Time to resolve
**Skills Required**: What expertise is needed
### Symptoms
- Observable indicators of the problem
### Detection
- How to identify this issue (metrics, logs, alerts)
### Diagnosis Steps
1. Step-by-step investigation
2. Decision tree for root cause
### Resolution Actions
- Specific commands and procedures
### Prevention
- How to avoid recurrence
### Escalation
- When and how to escalate
High Memory Usage
Severity: P2 (can escalate to P1 if OOM) Typical Duration: 15-60 minutes Skills Required: Linux administration, Kubernetes basics
Symptoms
- Pod approaching memory limits
- Slow response times
- OOMKilled container restarts
- Alertmanager:
ICNHighMemoryalert firing - Grafana: Memory usage graph trending upward
Detection
# Check current memory usage (K3s)
sudo kubectl -n icn top pods
# Check container limits vs usage
sudo kubectl -n icn describe pod -l app=icn | grep -A 5 "Limits:"
# Check for OOMKilled events
sudo kubectl -n icn get events --field-selector reason=OOMKilled
# Prometheus query
container_memory_working_set_bytes{namespace="icn", container="icnd"} / container_spec_memory_limit_bytes{namespace="icn", container="icnd"}
Diagnosis Steps
Identify memory consumers:
# SSH to pod and check memory breakdown sudo kubectl -n icn exec deploy/icn-daemon -- sh -c "cat /proc/meminfo" # Check Sled database cache size sudo kubectl -n icn exec deploy/icn-daemon -- ls -lh /data/Check for memory leaks (sustained growth over time):
- Review Grafana memory dashboard for trends
- Check if memory grows without corresponding workload increase
- Look for task/thread count growth
Identify workload patterns:
# Check gossip message rate curl -s http://10.8.30.40:30090/metrics | grep icn_gossip_messages # Check active connections curl -s http://10.8.30.40:30090/metrics | grep icn_network_connectionsDecision tree:
Memory growing steadily? ├── Yes → Possible memory leak → Collect heap profile, restart pod └── No → Memory spikes? ├── Yes → Workload spike → Check gossip/ledger activity └── No → Normal operation → Consider increasing limits
Resolution Actions
Immediate relief (buy time):
# Restart pod to reclaim memory
sudo kubectl -n icn rollout restart deployment/icn-daemon
If Sled cache is too large:
# Check data directory size
ssh atlas "du -sh /mnt/storage/k8s/icn-data/*"
# Sled will auto-compact, but can trigger compaction via restart
If limits are insufficient:
# Edit deployment to increase memory limit
sudo kubectl -n icn edit deployment icn-daemon
# Change: resources.limits.memory: 2Gi → 4Gi
If suspected memory leak:
- Collect memory diagnostics from the host:
# Capture memory map of the process sudo kubectl -n icn exec deploy/icn-daemon -- cat /proc/1/smaps > smaps.txt # Capture current metrics curl -s http://10.8.30.40:30090/metrics > metrics-snapshot.txt - Report to ICN developers with:
- Memory diagnostics (smaps, metrics snapshot)
- Memory growth timeline from Grafana
- Workload characteristics
- ICN version
Prevention
- Set appropriate limits: Base on observed peak + 20% buffer
- Enable memory alerting: Alert at 80% to catch before OOM
- Regular monitoring: Check weekly memory trends
- Keep ICN updated: Memory optimizations in new releases
Escalation
- P1 escalation if: OOM restarts > 3 in 1 hour
- Contact: ICN development team via GitHub issue
- Provide: Memory profiles, Prometheus data export, container logs
Gossip Convergence Failure
Severity: P2 Typical Duration: 30-90 minutes Skills Required: Distributed systems understanding, networking basics
Symptoms
- Nodes have divergent views of the network
- Messages not propagating to all nodes
- Entry counts differ significantly between nodes
- Dashboard shows uneven topic coverage
- Inconsistent data across cooperative members
Detection
# Check gossip metrics
curl -s http://10.8.30.40:30090/metrics | grep icn_gossip
# Key metrics to watch:
# - icn_gossip_entries_received_total (incoming entries)
# - icn_gossip_entries_published_total (outgoing entries)
# - icn_gossip_entries_total (total stored entries)
# - icn_gossip_subscriptions_total (active subscriptions)
# - icn_gossip_subscriptions_rejected_total (rejected subscriptions)
Diagnosis Steps
Check network connectivity:
# Verify peer count curl http://10.8.30.40:30080/v1/health | jq '.active_connections' # Check if peers are reachable sudo kubectl -n icn exec deploy/icn-daemon -- curl -s localhost:9100/metrics | grep icn_networkCheck topic subscriptions:
# List subscribed topics curl -s http://10.8.30.40:30090/metrics | grep icn_gossip_subscriptions # Compare subscription counts across nodes (if multi-node)Check message flow balance:
- Compare sent vs received message counts
- Large imbalance indicates sync issues
# Check message flow balance curl -s http://10.8.30.40:30090/metrics | grep -E "icn_gossip_(entries_received|entries_published)_total"Check for rejected messages:
# High rejection rate indicates trust or rate limiting issues curl -s http://10.8.30.40:30090/metrics | grep rejectedDecision tree:
Peers connected? ├── No → Network issue → Check network policies, firewall └── Yes → Messages flowing? ├── No → Check subscriptions, topic configuration └── Yes → Convergence slow? ├── Yes → Compare entry counts across nodes, check anti-entropy └── No → False alarm, verify with manual check
Resolution Actions
If network connectivity issue:
# Check network policies
sudo kubectl -n icn get networkpolicies
# Verify QUIC port is accessible (default 7777)
sudo kubectl -n icn get svc
# Test peer reachability from pod
sudo kubectl -n icn exec deploy/icn-daemon -- nc -zvu <peer-ip> 7777
If subscription issue:
# Restart to re-establish subscriptions
sudo kubectl -n icn rollout restart deployment/icn-daemon
If clock drift:
# Check system time on nodes
sudo kubectl -n icn exec deploy/icn-daemon -- date
date
# Ensure NTP is configured on host
timedatectl status
Trigger anti-entropy (automatic): Gossip anti-entropy runs automatically. To accelerate sync, restart the daemon:
sudo kubectl -n icn rollout restart deployment/icn-daemon
Prevention
- Monitor entry flow: Alert on
icn_gossip_entries_received_totalstagnating - Multiple peers: Ensure at least 3 peers for redundancy
- NTP configured: System clocks synchronized
- Network monitoring: Watch for packet loss or latency spikes
Escalation
- P1 escalation if: No convergence after 2 hours
- Contact: ICN development team
- Provide: Gossip metrics from all nodes, network topology
Ledger Sync Lag
Severity: P2 Typical Duration: 30-120 minutes Skills Required: Understanding of distributed ledgers, database basics
Symptoms
- Balance queries return stale data
- New transactions not appearing
icn_ledger_sync_lag_seconds > 300alert- Dashboard shows growing sync lag
- Members report transaction visibility delays
Detection
# Check ledger metrics
curl -s http://10.8.30.40:30090/metrics | grep icn_ledger
# Key metrics:
# - icn_ledger_sync_lag_seconds
# - icn_ledger_entries_total
# - icn_ledger_quarantine_size
# - icn_ledger_merge_conflicts_total
Diagnosis Steps
Check peer connectivity:
curl http://10.8.30.40:30080/v1/health | jq '.active_connections'Check quarantine size (indicates problematic entries):
curl -s http://10.8.30.40:30090/metrics | grep quarantineVerify entry propagation:
# Compare entry counts across nodes # Node 1: curl -s http://node1:9100/metrics | grep icn_ledger_entries_total # Node 2: curl -s http://node2:9100/metrics | grep icn_ledger_entries_totalCheck for forks:
- Different entry counts on different nodes
- Conflicting entries for same account
Decision tree:
Quarantine growing? ├── Yes → Conflicting entries → See Incident Response for quarantine handling └── No → Entries propagating? ├── No → Check gossip → See Gossip Convergence runbook └── Yes → Just slow? ├── Yes → High volume → Normal during catch-up └── No → Check for network issues
Resolution Actions
If slow propagation (high volume):
# Monitor progress - should improve over time
watch -n 5 "curl -s http://10.8.30.40:30090/metrics | grep icn_ledger_entries_total"
If quarantine issues:
# List quarantined entries
icnctl ledger quarantine list
# Get details on specific entry
icnctl ledger quarantine get <entry-hash>
# See incident-response.md for quarantine resolution
Ledger sync is automatic via gossip protocol. To accelerate sync after network partition, restart the daemon to reinitialize gossip connections.
If persistent lag with no progress:
# Restart to reinitialize sync state
sudo kubectl -n icn rollout restart deployment/icn-daemon
Prevention
- Monitor sync lag: Alert on > 5 minutes
- Regular health checks: Compare entry counts weekly
- Backup before upgrades: Ensure restore capability
- Test recovery: Monthly sync recovery drill
Escalation
- P1 escalation if: Sync lag > 1 hour or balances incorrect
- Contact: ICN development team and cooperative coordinators
- Provide: Ledger metrics, entry counts from all nodes
Trust Computation Errors
Severity: P2 Typical Duration: 30-60 minutes Skills Required: Graph algorithms understanding, database basics
Symptoms
- Trust scores returning errors or unexpected values
- Rate limiting affecting legitimate users
icn_trust_computation_errors_totalincrementing- Access control decisions failing
- Users reporting permission issues
Detection
# Check trust metrics
curl -s http://10.8.30.40:30090/metrics | grep icn_trust
# Key metrics:
# - icn_trust_computation_errors_total
# - icn_trust_cache_hits_total / icn_trust_cache_misses_total
# - icn_trust_edges_total
# - icn_trust_computation_duration_seconds
Diagnosis Steps
Check error types:
sudo kubectl -n icn logs deployment/icn-daemon | grep -i "trust.*error" | tail -20Verify graph consistency:
# Check edge count curl -s http://10.8.30.40:30090/metrics | grep icn_trust_edges # Check for cycles or invalid edges in logs sudo kubectl -n icn logs deployment/icn-daemon | grep -i "trust.*cycle\|invalid.*edge"Check computation performance:
# High computation time indicates graph issues curl -s http://10.8.30.40:30090/metrics | grep icn_trust_computation_durationDecision tree:
Errors in logs? ├── Yes → What type? │ ├── "invalid edge" → Check edge data integrity │ ├── "cycle detected" → Graph has loops → May need cleanup │ └── "computation timeout" → Graph too large → Check cache └── No → Cache issues? ├── High cache miss rate → Cache not warming → Check config └── Normal → Transient issue → Monitor
Resolution Actions
Clear computation cache (force recalculation):
# Restart pod to clear in-memory cache
sudo kubectl -n icn rollout restart deployment/icn-daemon
Verify edge data via metrics:
# Check edge count from Prometheus metrics
curl -s http://10.8.30.40:30090/metrics | grep icn_trust_edges
If graph corruption suspected:
- Check trust-related logs:
sudo kubectl -n icn logs deployment/icn-daemon | grep -i "trust.*error\|graph" - Contact ICN team with log excerpts and metrics
- May need to restart daemon to rebuild from persisted edges
Note: Direct trust edge management via CLI is planned for a future release. Currently, trust edges are managed through the RPC API or gossip protocol.
Prevention
- Monitor error rate: Alert on > 1 error/minute
- Cache tuning: Ensure cache is properly sized
- Edge validation: Validate edges on creation
- Regular audits: Weekly trust graph consistency check
Escalation
- P1 escalation if: Trust computation affecting access control
- Contact: ICN development team
- Provide: Trust graph export, error logs, computation metrics
Gateway Rate Limiting
Severity: P3 Typical Duration: 15-30 minutes Skills Required: HTTP/API understanding, rate limiting concepts
Symptoms
- Clients receiving 429 Too Many Requests
- Legitimate operations being blocked
icn_gateway_rate_limit_exceeded_totalincrementing- User complaints about "too many requests" errors
- API latency spikes due to queuing
Detection
# Check rate limiting metrics
curl -s http://10.8.30.40:30090/metrics | grep rate_limit
# Check gateway response codes
curl -s http://10.8.30.40:30090/metrics | grep icn_gateway_requests | grep 429
Diagnosis Steps
Identify rate-limited clients:
# Check logs for rate limit events sudo kubectl -n icn logs deployment/icn-daemon | grep -i "rate.limit\|429" | tail -20Check client trust scores:
# Low trust = lower rate limits # Check trust metrics for affected DIDsAnalyze request patterns:
# Check request rate by endpoint curl -s http://10.8.30.40:30090/metrics | grep icn_gateway_requests_totalDecision tree:
Single client rate limited? ├── Yes → Check if legitimate │ ├── Legitimate → Consider whitelist or trust boost │ └── Abuse → Keep limits, consider block └── No → Multiple clients affected? ├── Yes → Rate limits too aggressive → Adjust globally └── No → Spike in traffic → Normal protection working
Resolution Actions
Adjust rate limits (if too aggressive):
# Edit gateway configuration
sudo kubectl -n icn edit configmap icn-config
# Find rate_limit section and adjust values
# Restart to apply
sudo kubectl -n icn rollout restart deployment/icn-daemon
Whitelist trusted client (future feature):
icnctl gateway whitelist add did:icn:<client-did>
If abuse detected:
# Block abusive client (future feature)
icnctl gateway block did:icn:<abusive-did>
# Or reduce trust to minimize rate limit
icnctl trust set did:icn:<client-did> --score 0.1
Temporary rate limit increase:
# Environment variable override
sudo kubectl -n icn set env deployment/icn-daemon ICN_RATE_LIMIT_MULTIPLIER=2.0
Prevention
- Monitor rate limiting: Alert on sustained high rate limit events
- Trust-based limits: Ensure trust scores correctly reflect client reliability
- Capacity planning: Ensure adequate resources for expected load
- Client education: Document rate limits for API consumers
Escalation
- P1 escalation if: Rate limiting affecting critical cooperative operations
- Contact: ICN operations team, then development if config changes needed
- Provide: Rate limit metrics, affected client DIDs, traffic patterns
Node Won't Start
Severity: P1 Typical Duration: 15-60 minutes Skills Required: Linux administration, Kubernetes debugging
Symptoms
- Pod stuck in CrashLoopBackOff or Error state
- Container exits immediately after start
- No health endpoint response
- Startup logs show errors
- Previous container logs show crash
Detection
# Check pod status
sudo kubectl -n icn get pods
# Check pod events
sudo kubectl -n icn describe pod -l app=icn
# Check container logs
sudo kubectl -n icn logs deployment/icn-daemon --previous
Diagnosis Steps
Check container status:
sudo kubectl -n icn get pods -o jsonpath='{.items[0].status.containerStatuses[0]}'Review startup logs:
sudo kubectl -n icn logs deployment/icn-daemon --previous | head -50Check configuration:
# View current config sudo kubectl -n icn get configmap icn-config -o yamlVerify keystore:
# Check if keystore file exists sudo kubectl -n icn exec deploy/icn-daemon -- ls -la /data/keystore* 2>/dev/null || echo "No keystore"Check port availability:
# Ensure ports aren't already bound sudo kubectl -n icn exec deploy/icn-daemon -- ss -tlnpDecision tree:
Container starting? ├── No → Exit code? │ ├── 1 → Config error → Check config syntax │ ├── 137 → OOMKilled → Increase memory │ └── Other → Check logs for error └── Yes → Crashing after start? ├── Yes → Runtime error │ ├── "keystore" error → Keystore issue │ ├── "bind" error → Port conflict │ └── "permission" error → File permissions └── No → Health check failing? ├── Yes → Slow startup → Increase probe delays └── No → Should be working → Verify service routing
Resolution Actions
If configuration error:
# Validate config syntax
sudo kubectl -n icn get configmap icn-config -o jsonpath='{.data.config\.toml}' | head -20
# Fix and reapply
sudo kubectl -n icn edit configmap icn-config
sudo kubectl -n icn rollout restart deployment/icn-daemon
If keystore issue:
# Check keystore accessibility
ssh atlas "ls -la /mnt/storage/k8s/icn-data/"
# If missing, restore from backup
# See incident-response.md for restore procedure
If port conflict:
# Check what's using the port
sudo kubectl -n icn exec deploy/icn-daemon -- ss -ulnp | grep 7777
# May need to kill stuck process or wait for cleanup
If permission error:
# Check data directory permissions
ssh atlas "ls -la /mnt/storage/k8s/icn-data/"
# Fix permissions if needed
ssh atlas "sudo chown -R 1000:1000 /mnt/storage/k8s/icn-data/"
If OOMKilled:
# Increase memory limit
sudo kubectl -n icn patch deployment icn-daemon -p '{"spec":{"template":{"spec":{"containers":[{"name":"icnd","resources":{"limits":{"memory":"4Gi"}}}]}}}}'
If slow startup (health check timeout):
# Increase probe delays
sudo kubectl -n icn edit deployment icn-daemon
# Adjust: initialDelaySeconds, periodSeconds, failureThreshold
Prevention
- Config validation: Validate config before deployment
- Backup keystore: Regular encrypted backups
- Resource monitoring: Track resource usage trends
- Staged rollouts: Deploy changes incrementally
Escalation
- Immediate escalation if: Cannot start after 30 minutes
- Contact: ICN development team
- Provide: Full pod logs, describe output, config dump (sanitized)
Quick Reference: Key Metrics
| Issue | Key Metric | Alert Threshold |
|---|---|---|
| High Memory | container_memory_working_set_bytes |
> 85% limit |
| Gossip Issues | icn_gossip_subscriptions_rejected_total |
increasing rate |
| Ledger Lag | icn_ledger_sync_lag_seconds |
> 300 seconds |
| Trust Errors | icn_trust_computation_errors_total |
> 10/minute |
| Rate Limiting | icn_gateway_rate_limit_exceeded_total |
> 100/minute |
| Restarts | kube_pod_container_status_restarts_total |
> 3/hour |
Quick Prometheus Queries
# Memory usage percentage
container_memory_working_set_bytes{namespace="icn"} / container_spec_memory_limit_bytes{namespace="icn"}
# Request error rate
rate(icn_gateway_requests_total{status=~"5.."}[5m]) / rate(icn_gateway_requests_total[5m])
# Gossip entry receive rate
rate(icn_gossip_entries_received_total[5m])
# Rate limit events per minute
rate(icn_gateway_rate_limit_exceeded_total[5m]) * 60
# Gossip message latency p99
histogram_quantile(0.99, rate(icn_gossip_message_latency_seconds_bucket[5m]))
Escalation Paths
When to Escalate
| Severity | Criteria | Response Time |
|---|---|---|
| P1 | Service down, data loss risk | Immediate |
| P2 | Degraded service, user impact | 1 hour |
| P3 | Minor issues, no user impact | Next business day |
Escalation Contacts
- On-call operator: First responder for all issues
- ICN development team: GitHub issues for bugs/features
- Cooperative coordinators: For user-facing impact
Information to Gather Before Escalating
- Pod status and recent logs
- Relevant Prometheus metrics
- Timeline of events
- Actions already taken
- Current impact assessment
Version History
- 2026-01-04: Initial version with 6 runbooks (#221)