Emergency Node Restart

Summary

Emergency procedure for restarting an unresponsive or misbehaving ICN node.

Use when:

Node stops responding to RPC/API calls
Metrics show unhealthy state
Memory or CPU runaway
Deadlock suspected

Do NOT use when:

Routine maintenance (use rolling restart)
Data corruption suspected (use Data Recovery runbook)

Prerequisites

Access to node (SSH or kubectl)
Keystore passphrase available
Backup exists (check with icnctl verify-backup)

Procedure

Kubernetes Deployment

1. Assess Current State

# Check pod status
kubectl -n icn get pods -o wide

# Check recent events
kubectl -n icn get events --sort-by='.lastTimestamp' | tail -20

# Check resource usage
kubectl -n icn top pod

2. Capture Diagnostics (if possible)

# Save logs before restart
kubectl -n icn logs deployment/icn-daemon --tail=1000 > /tmp/icn-pre-restart.log

# Save metrics snapshot
curl -s http://<node-ip>:9100/metrics > /tmp/icn-pre-restart-metrics.txt

3. Graceful Restart (Preferred)

# Rolling restart - zero downtime
kubectl -n icn rollout restart deployment/icn-daemon

# Wait for completion
kubectl -n icn rollout status deployment/icn-daemon --timeout=300s

4. Force Restart (If Graceful Fails)

# Delete pod to force recreation
kubectl -n icn delete pod -l app=icn-daemon --force --grace-period=0

# Verify new pod starts
kubectl -n icn get pods -w

Systemd Deployment

1. Assess Current State

# Check service status
systemctl status icnd

# Check recent logs
journalctl -u icnd -n 100 --no-pager

# Check resource usage
ps aux | grep icnd

2. Capture Diagnostics

# Save logs
journalctl -u icnd -n 5000 > /tmp/icn-pre-restart.log

3. Graceful Restart

# Send SIGTERM for graceful shutdown
systemctl restart icnd

# Wait and verify
sleep 10
systemctl status icnd

4. Force Restart

# If graceful fails, force kill
systemctl kill -s SIGKILL icnd
systemctl start icnd

Verification

After restart, verify node health:

# 1. Check process is running
pgrep -f icnd || kubectl -n icn get pods

# 2. Check identity accessible
icnctl id show
# OR: kubectl -n icn exec deploy/icn-daemon -- icnctl id show

# 3. Check metrics endpoint
curl -s http://localhost:9100/metrics | grep icn_supervisor_state
# Should show: icn_supervisor_state 2 (running)

# 4. Check gossip connectivity
curl -s http://localhost:9100/metrics | grep icn_gossip_peers
# Should show peers > 0 (unless single node)

# 5. Check RPC responding
icnctl status

Rollback

If restart causes worse problems:

Kubernetes

# Rollback to previous deployment
kubectl -n icn rollout undo deployment/icn-daemon

# Verify rollback
kubectl -n icn rollout status deployment/icn-daemon

Systemd

# If new binary was deployed, revert
# (Assumes previous binary saved)
cp /usr/local/bin/icnd.backup /usr/local/bin/icnd
systemctl restart icnd

Post-Incident

Document what triggered the restart
Review captured logs for root cause
Create issue if bug found
Update monitoring if detection was slow

Data Recovery - If data corruption suspected
Troubleshooting - Common issues
Version Upgrade - If upgrade caused issue

Emergency Node Restart

Summary

Prerequisites

Procedure

Kubernetes Deployment

1. Assess Current State

2. Capture Diagnostics (if possible)

3. Graceful Restart (Preferred)

4. Force Restart (If Graceful Fails)

Systemd Deployment

1. Assess Current State

2. Capture Diagnostics

3. Graceful Restart

4. Force Restart

Verification

Rollback

Kubernetes

Systemd

Post-Incident

Related