ICN Incident Response Playbook
This document provides operational procedures for responding to common incidents in ICN deployments. While some responses are currently manual or crude in v0.1, having documented procedures is critical for operational readiness.
Audience: ICN node operators, cooperative system administrators, incident responders
Status: Living document - procedures will evolve as ICN matures
Table of Contents
- General Incident Response Framework
- K3s Deployment Quick Reference
- Incident: Node Compromise
- Incident: Ledger Corruption Detected
- Incident: Key Suspected Stolen
- Incident: Network Partition
- Incident: Pod Failure (K3s)
- Incident: Gossip Storm
- Incident: Quarantine Growth
- Incident: Storage Issues (K3s)
- Incident: Backup Verification Failure
- Monitoring and Detection
- Communication Templates
General Incident Response Framework
Severity Levels
P0 - Critical: Identity compromise, data loss, complete service outage
- Response time: Immediate
- Escalation: All hands
P1 - High: Partial service degradation, security concern
- Response time: Within 1 hour
- Escalation: On-call operator
P2 - Medium: Non-critical issues, performance degradation
- Response time: Within 4 hours
- Escalation: Normal channels
P3 - Low: Minor issues, cosmetic problems
- Response time: Next business day
- Escalation: Ticket queue
Response Steps
- Detect: Monitoring alerts, user reports, health checks
- Assess: Determine severity and scope
- Contain: Prevent further damage
- Recover: Restore normal operations
- Document: Record what happened and how it was resolved
- Review: Post-mortem and process improvement
K3s Deployment Quick Reference
This section provides K3s-specific commands for the live homelab deployment.
Note: Paths like
/home/matt/projects/icnare specific to the current deployment. Adjust paths according to your local setup.
Cluster Access
# SSH to control plane
ssh ubuntu@10.8.30.40
# All kubectl commands require sudo on control plane
sudo kubectl -n icn <command>
Quick Diagnosis Commands
# Check pod status
sudo kubectl -n icn get pods -o wide
# View pod logs (live)
sudo kubectl -n icn logs -f deployment/icn-daemon
# View pod logs (last 100 lines)
sudo kubectl -n icn logs deployment/icn-daemon --tail=100
# Describe pod for events and status
sudo kubectl -n icn describe pod -l app=icn
# Check resource usage
sudo kubectl -n icn top pods
# View all ICN resources
sudo kubectl -n icn get all
# Check PVC status
sudo kubectl -n icn get pvc
Health Checks
# ICN health endpoint
curl http://10.8.30.40:30080/v1/health
# Prometheus metrics
curl http://10.8.30.40:30090/metrics | head -50
# Grafana dashboard
# Open: http://10.8.30.40:30300
# Check node identity
sudo kubectl -n icn exec deploy/icn-daemon -- /usr/local/bin/icnctl id show
Restart Procedures
# Restart ICN daemon (graceful)
sudo kubectl -n icn rollout restart deployment/icn-daemon
# Wait for rollout to complete
sudo kubectl -n icn rollout status deployment/icn-daemon
# Force delete stuck pod
sudo kubectl -n icn delete pod -l app=icn --force --grace-period=0
# Full redeploy from local
cd /home/matt/projects/icn/deploy/k8s && make full-deploy-dev
Backup & Recovery
# Backup current state (from within pod)
sudo kubectl -n icn exec deploy/icn-daemon -- /usr/local/bin/icnctl backup /data/backup.tar
# Copy backup from pod to local (get pod name first)
POD=$(sudo kubectl -n icn get pod -l app=icn -o jsonpath='{.items[0].metadata.name}')
sudo kubectl -n icn cp $POD:/data/backup.tar /tmp/icn-backup.tar
# View backup location (NFS on Atlas)
ssh atlas "ls -la /mnt/storage/k8s/icn-data/"
Network Diagnostics
# Check service endpoints
sudo kubectl -n icn get endpoints
# Test internal connectivity
sudo kubectl -n icn exec deploy/icn-daemon -- curl localhost:9100/metrics
# View network policies
sudo kubectl -n icn get networkpolicies
# Check node port exposure
sudo kubectl -n icn get svc
Alertmanager
# View active alerts
curl -s http://10.8.30.40:30093/api/v1/alerts | jq '.data[] | {labels: .labels, state: .status.state}'
# Silence an alert (for maintenance)
# Use Alertmanager UI: http://10.8.30.40:30093
Incident: Node Compromise
Severity: P0 - Critical
Symptoms
- Unauthorized access to node detected
- Suspicious processes running
- Unexpected network traffic
- Alerts from intrusion detection systems
- DID signing messages you didn't authorize
Immediate Actions (First 15 Minutes)
Isolate the node immediately:
# Stop ICNd systemctl stop icnd # OR pkill icnd # Block network access (if available) sudo iptables -A INPUT -j DROP sudo iptables -A OUTPUT -j DROPPreserve evidence:
# Capture running processes ps auxf > /tmp/incident-processes.txt # Capture network connections netstat -tulpn > /tmp/incident-netstat.txt # Copy ICN logs cp -r ~/.icn/logs /tmp/incident-logs-$(date +%Y%m%d-%H%M%S) # Capture system logs journalctl -u icnd --since "1 hour ago" > /tmp/incident-journalctl.txtNotify cooperative members:
- Alert other node operators immediately
- Warn them NOT to trust messages from your DID
- Coordinate on out-of-band communication (Signal, phone, etc.)
Recovery Actions (Next 2 Hours)
Revoke compromised device:
On a trusted device (not the compromised one):
# Restore your identity from backup to a secure device icnctl --data-dir /secure/path restore /backup/icn-backup.tar # List devices to identify the compromised one icnctl device list # Revoke the compromised device icnctl device revoke device-compromised-id --reason compromisedRotate all keys:
# On the secure device, rotate the main key icnctl id rotate --reason "Compromise detected - rotating all keys"Audit recent activity:
# Check ledger for unauthorized transactions icnctl ledger history --limit 100 # Review trust edges - did attacker add malicious trust? icnctl trust list # Check deployed contracts # (Future: icnctl contract list)
Investigation
Determine attack vector:
- Check system logs for unauthorized SSH access
- Review application logs for exploitation attempts
- Examine network logs for command & control traffic
- Check for malware or rootkits
- Review recent software updates or configuration changes
Assess damage:
- What data was accessed?
- Were transactions authorized on your behalf?
- Was trust graph manipulated?
- Were contracts deployed or modified?
Long-Term Actions
Harden the replacement node:
- Reinstall OS from scratch (don't trust compromised system)
- Apply all security patches
- Enable fail2ban or equivalent
- Configure firewall rules (only allow necessary ports)
- Enable audit logging
- Consider using a hardware security module (HSM) for keys
Post-mortem:
- Document timeline of compromise
- Root cause analysis
- Update security procedures
- Share lessons learned with cooperative
Prevention
- Principle of Least Privilege: Run ICNd as non-root user
- Network Segmentation: Firewall rules limiting access
- Regular Backups: Daily encrypted backups to off-site location
- Monitoring: Set up alerts for unusual activity
- Multi-Device: Use separate devices for different risk profiles
- HSM: Consider hardware security modules for high-value identities
Incident: Ledger Corruption Detected
Severity: P1 - High (can escalate to P0 if widespread)
Symptoms
- Quarantine size growing rapidly (
icn_ledger_quarantine_sizemetric) - Merge conflict alerts (
icn_ledger_merge_conflicts_total) - Balance inconsistencies reported by users
- Monitoring dashboard shows ledger errors
- Failed double-entry validation
Assessment
Check quarantine status:
# View quarantine size from dashboard curl http://localhost:8080/v1/health | jq '.ledger_quarantine_size' # Or check Prometheus metrics curl http://localhost:9100/metrics | grep icn_ledger_quarantine_sizeList quarantined entries (future command):
icnctl ledger quarantine listDetermine scope:
- Is this affecting one account or many?
- Is this a local issue or network-wide?
- What's the time window of affected transactions?
Recovery Procedures
Scenario 1: Small Number of Conflicting Entries
If < 10 entries quarantined:
Inspect each entry:
icnctl ledger quarantine get <entry-hash>Manual resolution:
- If entry is valid but conflicted: Release from quarantine
- If entry is malicious or erroneous: Drop permanently
# Release valid entry (retry processing) icnctl ledger quarantine release <entry-hash> # Drop invalid entry icnctl ledger quarantine drop <entry-hash>Verify resolution:
icnctl ledger balance <account-id>
Scenario 2: Large-Scale Corruption
If > 100 entries quarantined or balances severely wrong:
โ ๏ธ This is a critical incident - coordinate with cooperative before proceeding
Stop the daemon:
systemctl stop icndBackup current state (even if corrupted):
icnctl backup /backup/corrupted-state-$(date +%Y%m%d-%H%M%S).tarRestore from last known good backup:
# Identify last good backup (check timestamp and quarantine size) ls -lh /backup/ # Restore icnctl restore /backup/icn-backup-20250114.tar --forcePurge quarantine:
icnctl ledger quarantine purgeRestart daemon:
systemctl start icndMonitor gossip sync:
- Watch dashboard for entries being re-synced
- Monitor quarantine to see if conflicts reappear
- If they do, there's a systemic issue (see Investigation below)
Scenario 3: Unrecoverable Corruption
If restore doesn't work or no good backup exists:
๐จ Nuclear option - coordinate with entire cooperative
Reconstruct from cooperative consensus:
- Poll all nodes for their ledger state
- Identify the most common version (Byzantine consensus)
- Majority state becomes canonical
Manual ledger reconstruction (requires all members):
- Export ledger data from trusted nodes
- Manually reconcile discrepancies
- Re-import agreed-upon state
This is a last resort and requires governance decision
Investigation
Why did corruption occur?
Common causes:
- Concurrent updates - Two nodes created conflicting transactions simultaneously
- Clock skew - Node clocks out of sync causing timestamp issues
- Malicious entry - Attacker injected invalid transaction
- Software bug - Double-entry validation logic failure
- Disk corruption - Hardware failure corrupting database
Check for:
# Clock skew
timedatectl status
# Disk errors
dmesg | grep -i error
smartctl -a /dev/sda
# Recent software updates
journalctl -u icnd --since "1 week ago" | grep upgrade
Prevention
- Regular backups: Automated daily backups
- Monitoring: Alert on quarantine size > 10
- Clock sync: NTP properly configured
- Disk health: SMART monitoring enabled
- Testing: Validate ledger integrity weekly
- Redundancy: Multiple nodes per cooperative
Incident: Key Suspected Stolen
Severity: P0 - Critical
Symptoms
- Unauthorized transactions appearing in ledger
- Messages signed by your DID that you didn't send
- Your device missing or stolen
- Suspicious login attempts
- Passphrase may have been compromised
Immediate Actions (First 30 Minutes)
If you have access to an authorized device:
Revoke the compromised device IMMEDIATELY:
# From a secure device icnctl device list icnctl device revoke device-stolen-id --reason lostRotate your main key:
icnctl id rotate --reason "Key compromise suspected"Change your passphrase:
# Export with old passphrase icnctl id export /tmp/identity-temp.age # Import with new passphrase (will prompt for new one) icnctl id import /tmp/identity-temp.age # Securely delete temp file shred -u /tmp/identity-temp.ageNotify cooperative:
- Alert all members immediately
- Provide new DID after rotation
- Request they update trust edges
If you DON'T have access to an authorized device:
This is the worst-case scenario - you need social recovery
Contact cooperative members urgently:
- Use out-of-band communication (phone, in-person)
- Verify your identity through established procedures
- Request they revoke trust edges to your compromised DID
Create new identity:
# On a secure device icnctl --data-dir ~/.icn-new id initSocial recovery (Phase 11.6 - not yet implemented):
# Future: Request guardians to approve identity recovery icnctl id recover --guardians did:icn:guardian1,did:icn:guardian2Rebuild trust:
- Request cooperative members add trust edges to new DID
- Re-establish economic relationships
- Accept that old ledger history is tied to compromised DID
โ ๏ธ This is painful - emphasizes importance of multi-device setup
Key Rotation Ceremony (Planned Migration)
For non-emergency key rotation (e.g., annual security practice):
Schedule rotation window with cooperative:
- Announce rotation 1 week in advance
- Pick low-activity time window
- Ensure all members are available for coordination
Pre-rotation checks:
# Verify all devices are accessible icnctl device list # Create backup icnctl backup /backup/pre-rotation-$(date +%Y%m%d).tar # Verify backup icnctl restore /tmp/test-restore /backup/pre-rotation-*.tarExecute rotation:
icnctl id rotate --reason "Annual key rotation"Verify rotation:
# Check new DID icnctl id show # Verify old key is marked as rotated icnctl device listUpdate external systems:
- Notify cooperative members of new DID
- Update any external databases or directories
- Test signing and encryption with new keys
Post-rotation monitoring (24-48 hours):
- Watch for any messages still signed with old key
- Monitor gossip for identity updates
- Verify all devices received the rotation event
Prevention
- Multi-device setup: Never rely on single device
- Secure passphrase storage: Password manager, not written down
- Device encryption: Full-disk encryption on all devices
- Physical security: Lock devices when unattended
- Social recovery setup: Configure guardians (when available)
- Regular rotation: Annual planned key rotations
- Backup verification: Test restore monthly
Incident: Network Partition
Severity: P1 - High
Symptoms
- Peer count drops to zero
- Gossip sync stalls
- Monitoring shows no network activity
- Can't reach other nodes
Diagnosis
Check network connectivity:
# Test internet connection ping 8.8.8.8 # Test DNS nslookup google.com # Check if ICNd is running systemctl status icndCheck ICN peer status:
# View peer count from dashboard curl http://localhost:8080/v1/health | jq '.active_connections' # Check network metrics curl http://localhost:9100/metrics | grep icn_network_connections_activeCheck mDNS discovery:
# Verify mDNS is working avahi-browse -a
Recovery
Restart ICNd:
systemctl restart icndCheck firewall rules:
# Verify QUIC port is open (default: 7777/udp) sudo iptables -L -n | grep 7777 # Verify mDNS port is open (5353) sudo iptables -L -n | grep 5353Manual peer dial (future feature):
# If mDNS fails, manually dial known peers icnctl network dial <peer-multiaddr> <peer-did>Check for split-brain:
- If network partitions, different nodes may have divergent state
- When partition heals, gossip anti-entropy will sync
- Monitor quarantine for conflicts
Prevention
- Multiple network paths: Don't rely on single network link
- Monitoring: Alert on peer count < 2
- Fallback discovery: Manual peer list in config
- Regular testing: Chaos engineering - test partition recovery
Incident: Pod Failure (K3s)
Severity: P1 - High
Symptoms
- Pod not in
Runningstate - Health endpoint returns 503 or times out
- Grafana shows gaps in metrics
- Alertmanager fires
ICNPodNotReadyalert
Diagnosis
Check pod status:
ssh ubuntu@10.8.30.40 sudo kubectl -n icn get pods -o wideCheck pod events:
sudo kubectl -n icn describe pod -l app=icnCommon issues in events:
ImagePullBackOff- Image not available on nodeCrashLoopBackOff- Application crashing repeatedlyOOMKilled- Out of memoryPending- No resources or node selector mismatch
Check logs:
# Current pod logs sudo kubectl -n icn logs deployment/icn-daemon --tail=200 # Previous container (if crashed) sudo kubectl -n icn logs deployment/icn-daemon --previousCheck node resources:
sudo kubectl top nodes sudo kubectl describe nodes | grep -A 5 "Allocated resources"
Recovery
Scenario 1: CrashLoopBackOff
Check logs for crash reason:
sudo kubectl -n icn logs deployment/icn-daemon --previousCommon fixes:
- Config error: Check ConfigMap for typos
- Permission error: Verify PVC is mounted correctly
- Port conflict: Check if ports are available
Restart after fix:
sudo kubectl -n icn rollout restart deployment/icn-daemon
Scenario 2: OOMKilled
Check memory usage before OOM:
sudo kubectl -n icn describe pod -l app=icn | grep -A 3 "Last State"Increase memory limit (edit deployment):
sudo kubectl -n icn edit deployment icn-daemon # Change: resources.limits.memory: 2Gi -> 4GiOr redeploy with updated manifests:
cd /home/matt/projects/icn/deploy/k8s && make full-deploy-dev
Scenario 3: ImagePullBackOff
Check image availability:
sudo crictl images | grep icnSync image to nodes:
cd /home/matt/projects/icn/deploy/k8s make sync-images
Scenario 4: Stuck Pending
Check for resource constraints:
sudo kubectl -n icn describe pod -l app=icnCheck PVC binding:
sudo kubectl -n icn get pvc sudo kubectl -n icn describe pvc icn-dataCheck NFS server (Atlas):
ssh atlas "systemctl status nfs-kernel-server" showmount -e 10.8.10.25
Prevention
- Resource limits: Set appropriate CPU/memory limits
- Health probes: Liveness and readiness probes configured
- PodDisruptionBudget: Prevent accidental disruption
- Monitoring: Alert on pod restarts > 3 in 10 minutes
Incident: Gossip Storm
Severity: P2 - Medium
Symptoms
- Extremely high network bandwidth usage
- CPU pegged at 100%
- Gossip metrics showing thousands of messages/sec
- Dashboard shows message count exploding
Diagnosis
Check gossip metrics:
curl http://localhost:9100/metrics | grep icn_gossipIdentify problematic topic:
- Look for topic with disproportionate activity
- Check for single peer sending excessive messages
Mitigation
Rate limiting is automatic:
- ICN has trust-based rate limiting built in
- Untrusted peers limited to 10 msg/sec
- Trusted peers limited to 200 msg/sec
If rate limiting insufficient:
# Restart daemon (clears in-memory state) systemctl restart icndBlock malicious peer (future feature):
# Remove trust edge to spammer icnctl trust remove did:icn:spammer # Block peer entirely icnctl network block did:icn:spammer
Prevention
- Trust gating: Only subscribe trusted peers to sensitive topics
- Entry limits: Configure max entries per topic
- Monitoring: Alert on unusual message rates
Incident: Quarantine Growth
Severity: P2 - Medium (can escalate)
Symptoms
icn_ledger_quarantine_sizemetric growing- Dashboard shows degraded health
- Merge conflicts incrementing
Investigation
List quarantined entries:
icnctl ledger quarantine listIdentify patterns:
- Same account appearing repeatedly?
- Specific time period?
- Common error type?
Resolution
Manual review (if < 50 entries):
# Inspect each entry icnctl ledger quarantine get <hash> # Release or drop based on validity icnctl ledger quarantine release <hash> # OR icnctl ledger quarantine drop <hash>Automated cleanup (if > 50 entries):
# Purge expired entries (older than 7 days) icnctl ledger quarantine purgeRoot cause fix:
- If clock skew: Sync NTP
- If malicious: Remove trust edge
- If bug: Report to ICN developers
Incident: Storage Issues (K3s)
Severity: P1 - High
Symptoms
- Write failures in logs (
sled error,I/O error) - PVC shows as
PendingorLost - NFS mount errors
- Disk space alerts
- Data not persisting across pod restarts
Diagnosis
Check PVC status:
ssh ubuntu@10.8.30.40 sudo kubectl -n icn get pvc sudo kubectl -n icn describe pvc icn-dataCheck disk space on NFS server:
ssh atlas "df -h /mnt/storage"Check NFS service:
ssh atlas "systemctl status nfs-kernel-server" ssh atlas "exportfs -v"Check mount from pod:
sudo kubectl -n icn exec deploy/icn-daemon -- df -h /data sudo kubectl -n icn exec deploy/icn-daemon -- ls -la /dataCheck for Sled database issues:
sudo kubectl -n icn logs deployment/icn-daemon | grep -i "sled\|error\|corrupt"
Recovery
Scenario 1: NFS Server Unreachable
Check network connectivity:
ping 10.8.10.25Restart NFS service:
ssh atlas "sudo systemctl restart nfs-kernel-server"Verify exports:
showmount -e 10.8.10.25Restart ICN pod (to remount):
sudo kubectl -n icn rollout restart deployment/icn-daemon
Scenario 2: Disk Full
Check space usage:
ssh atlas "du -sh /mnt/storage/k8s/icn-data/*"Clean old backups:
ssh atlas "find /mnt/storage/k8s/icn-data/backups -mtime +30 -delete"Compact Sled database:
Note: An
icnctl db compactcommand does not currently exist. Sled performs automatic compaction. If manual compaction is needed, consider stopping the daemon and using Sled tools directly.
Scenario 3: Sled Corruption
Stop the daemon:
sudo kubectl -n icn scale deployment icn-daemon --replicas=0Backup current state:
ssh atlas "cp -r /mnt/storage/k8s/icn-data /mnt/storage/k8s/icn-data-corrupted-$(date +%Y%m%d)"Restore from backup:
ssh atlas "ls -la /mnt/storage/k8s/icn-data/backups/" # Copy latest good backup to data directoryRestart daemon:
sudo kubectl -n icn scale deployment icn-daemon --replicas=1
Prevention
- Monitoring: Alert on disk usage > 80%
- Automated backups: Daily snapshots with rotation
- NFS redundancy: Consider replicated storage
- Health checks: Include storage health in liveness probe
Incident: Backup Verification Failure
Severity: P2 - Medium (can escalate to P1 if no valid backups exist)
Symptoms
ICNBackupVerificationFailedalert firing- Backup verification CronJob failing
- No recent backup completion records
ICNBackupMissingcritical alert (no backup in 26+ hours)
Diagnosis
Check backup job status:
ssh ubuntu@10.8.30.40 sudo kubectl -n icn get jobs -l component=backup sudo kubectl -n icn get jobs -l component=backup-verifyView backup job logs:
# Get latest backup job sudo kubectl -n icn logs job/$(sudo kubectl -n icn get jobs -l component=backup-job --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1].metadata.name}') # Get latest verification job sudo kubectl -n icn logs job/$(sudo kubectl -n icn get jobs -l component=backup-verify-job --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1].metadata.name}')Check backup files directly:
ssh atlas "ls -la /mnt/storage/k8s/icn-backups/"Check backup PVC:
sudo kubectl -n icn get pvc icn-backups sudo kubectl -n icn describe pvc icn-backups
Recovery
Scenario 1: Verification Failing but Backups Exist
Run manual verification:
# SSH to a node with backup access ssh atlas # Test newest backup cd /mnt/storage/k8s/icn-backups NEWEST=$(ls -t icn-backup-*.tar.gz | head -1) # Verify archive integrity tar -tzf "$NEWEST" > /dev/null && echo "Archive OK" # Extract and check contents mkdir -p /tmp/verify && tar -xzf "$NEWEST" -C /tmp/verify ls -la /tmp/verify rm -rf /tmp/verifyIf backup is valid, check verification script:
sudo kubectl -n icn get configmap backup-scripts -o yaml
Scenario 2: No Recent Backups
Check CronJob schedule:
sudo kubectl -n icn get cronjob icn-backup -o yaml | grep scheduleRun backup manually:
sudo kubectl -n icn create job --from=cronjob/icn-backup manual-backup-$(date +%Y%m%d-%H%M%S) # Watch job progress sudo kubectl -n icn get jobs -wCheck for resource issues:
# Check if backup PVC has space ssh atlas "df -h /mnt/storage/k8s/icn-backups" # Check if data PVC is accessible sudo kubectl -n icn exec deploy/icn-daemon -- ls -la /data
Scenario 3: Backup Storage Full
Check usage:
ssh atlas "du -sh /mnt/storage/k8s/icn-backups/*"Clean old backups (keep at least 3):
ssh atlas "cd /mnt/storage/k8s/icn-backups && ls -t icn-backup-*.tar.gz | tail -n +4 | xargs rm -v"Adjust retention (if needed, edit CronJob):
sudo kubectl -n icn edit cronjob icn-backup # Change: -mtime +7 to -mtime +3 for 3-day retention
Backup Restoration Procedure
โ ๏ธ Full restoration should be coordinated with cooperative - this affects service availability
Stop ICN daemon:
sudo kubectl -n icn scale deployment icn-daemon --replicas=0Identify backup to restore:
ssh atlas "ls -la /mnt/storage/k8s/icn-backups/" # Select backup by date - prefer newest verified backupBackup current state (even if corrupted):
ssh atlas "cp -r /mnt/storage/k8s/icn-data /mnt/storage/k8s/icn-data-pre-restore-$(date +%Y%m%d-%H%M%S)"Clear current data and restore:
ssh atlas cd /mnt/storage/k8s # Clear current data rm -rf icn-data/* # Extract backup tar -xzf icn-backups/icn-backup-YYYYMMDD-HHMMSS.tar.gz -C icn-data # Verify extraction ls -la icn-data/Restart ICN daemon:
sudo kubectl -n icn scale deployment icn-daemon --replicas=1 sudo kubectl -n icn rollout status deployment/icn-daemonVerify restoration:
# Check health curl http://10.8.30.40:30080/v1/health # Check identity sudo kubectl -n icn exec deploy/icn-daemon -- /usr/local/bin/icnctl id show # Monitor logs for errors sudo kubectl -n icn logs -f deployment/icn-daemonMonitor gossip resync:
- Watch for entries being replayed from network
- Check quarantine for conflicts
- Verify ledger balances
Prevention
- Automated verification: Daily verification CronJob at 6am (4 hours after backup)
- Multiple retention periods: Keep daily (7), weekly (4), monthly (3)
- Off-site backups: Consider replicating to cloud storage
- Alert on age: Critical alert if newest backup > 26 hours old
- Test restores: Monthly restoration drill to verify procedure
Monitoring and Detection
Key Metrics to Monitor
Critical Alerts (page on-call):
icn_ledger_quarantine_size > 100- Ledger issuesicn_network_connections_active == 0- Network partition- Health endpoint returns 503 - Node unhealthy
Warning Alerts (notify in Slack):
icn_gossip_subscriptions_rejected_totalincrementing - Trust issuesicn_network_messages_rate_limited_totalspiking - Possible attackicn_ledger_merge_conflicts_totalincrementing - Sync problems
Info Alerts (log for trends):
- Peer count fluctuations
- Gossip topic growth
- Transaction volume changes
Dashboard Checks
Visit http://localhost:8080/ daily and verify:
- โ Status: Healthy (green banner)
- โ Active connections > 0
- โ Quarantine size < 10
- โ No unusual spikes in metrics
Health Check Integration
Configure external monitoring:
# Kubernetes liveness probe
http://icn-node:8080/health
# Systemd watchdog
WatchdogSec=60s
# Nagios/Zabbix
curl -f http://localhost:8080/v1/health || exit 1
Communication Templates
Status Update Template
Use this template for ongoing incident updates:
ICN Incident Update - [INCIDENT_ID]
Status: [Investigating | Identified | Monitoring | Resolved]
Severity: [P0 Critical | P1 High | P2 Medium | P3 Low]
Time: [YYYY-MM-DD HH:MM UTC]
Summary:
[Brief description of current status]
Impact:
- Affected services: [List affected components]
- User impact: [Description of user-facing effects]
Current Actions:
- [What is being done right now]
Next Update:
Expected in [X] minutes/hours
---
ICN Operations Team
Initial Incident Notification
Use when first declaring an incident:
Subject: [P0/P1/P2/P3] ICN Incident: [Brief Description]
Team,
We are investigating an incident affecting [component/service].
Detected: [Time UTC]
Severity: [P0-P3]
Initial Symptoms: [What was observed]
Status at this update:
- [What we know so far]
Immediate Actions:
- [Responder name] is investigating
- [Any containment steps taken]
Communication Channel:
[Slack channel / Video call link]
Next Update: [Time]
---
[Responder Name]
Resolution Notification
Use when incident is resolved:
Subject: [RESOLVED] ICN Incident: [Brief Description]
Team,
The incident affecting [component/service] has been resolved.
Timeline:
- Detected: [Time UTC]
- Identified: [Time UTC]
- Resolved: [Time UTC]
- Total Duration: [X hours/minutes]
Root Cause:
[Brief explanation of what caused the incident]
Resolution:
[What was done to fix it]
User Impact:
[Summary of impact during incident]
Follow-up Actions:
- [ ] Post-mortem scheduled for [Date]
- [ ] [Any immediate improvements planned]
---
[Responder Name]
Stakeholder Briefing (Non-Technical)
Use for executive or external stakeholder updates:
Subject: ICN Service Update - [Date]
Summary:
On [Date], the ICN network experienced [brief non-technical description].
The issue was resolved at [Time] after [Duration].
Impact:
- [What users/cooperatives experienced]
- [Any data or transaction concerns]
Resolution:
Our team [brief explanation of fix without technical jargon].
Prevention:
We are implementing [improvements] to prevent recurrence.
Questions:
Please contact [contact person] for additional information.
---
ICN Operations
Emergency Contacts
ICN Development Team:
- GitHub Issues: https://github.com/InterCooperative-Network/icn/issues
- Email: [TBD]
Cooperative Contacts:
- Primary: [Your cooperative's emergency contact]
- Secondary: [Backup contact]
- Out-of-band: [Signal group, phone tree]
Post-Incident Review Template
After resolving an incident, document:
Incident Summary:
- Date/time of detection
- Severity level
- Duration of incident
Timeline:
- When was it first detected?
- What actions were taken and when?
- When was it resolved?
Root Cause:
- What caused the incident?
- Why wasn't it prevented?
- Why wasn't it detected sooner?
Impact:
- How many nodes affected?
- Data loss or corruption?
- Economic impact?
Action Items:
- What monitoring should be added?
- What procedures should be updated?
- What code changes are needed?
Lessons Learned:
- What went well?
- What could be improved?
- How can we prevent this in the future?
Version History
- 2026-01-04: Added backup verification incident procedures, restoration guide (#320)
- 2026-01-04: Added K3s-specific procedures, communication templates (#324)
- 2025-01-14: Initial version (Track B1)