Track B1: Graceful Restart Implementation
Date: 2025-01-14
Status: Complete ✅
Commits: b14aa23, f26eef2
Overview
Implemented state snapshot functionality to enable graceful daemon restarts without losing critical runtime state. This allows operators to restart ICN nodes for maintenance, upgrades, or configuration changes while maintaining:
- Vector clock causality (no duplicate message processing)
- Topic subscriptions (no need to re-subscribe)
- Peer X25519 keys (immediate encrypted communication)
Problem Statement
Without state persistence:
- Vector clocks reset → duplicate message processing, replay attacks
- Subscriptions lost → manual re-subscription after every restart
- X25519 keys lost → new key exchange required before encrypted communication
- Full resync required → slow restart, network overhead
Architecture
New Crate: icn-snapshot
Created standalone crate with zero dependencies (except serde) to avoid circular dependency issues.
Key Design Decision: Separate crate prevents circular deps between icn-core ↔ icn-net ↔ icn-gossip.
Types:
StateSnapshot: Top-level snapshot with version, timestamp, gossip/network stateGossipState: Vector clocks (HashMap<String, u64>), subscriptions, topic metadataNetworkState: Peer X25519 keys, peer addresses (empty for now)TopicMetadata: Serializable topic configuration
Functions:
save_snapshot(snapshot, data_dir): Atomic write via temp file + renameload_snapshot(data_dir): Returns Optiondelete_snapshot(data_dir): Cleanup
Format: JSON (human-readable, easy to debug, migrate)
GossipActor Integration
Export (export_state()):
pub fn export_state(&self) -> icn_snapshot::GossipState {
// Export vector clock (DID -> count)
let vector_clock = self.clock.clock.iter()
.map(|(did, count)| (did.to_string(), *count))
.collect();
// Export subscriptions (topic -> [DIDs])
let subscriptions = self.subscriptions.iter()
.map(|(topic, subs)| (topic.clone(), subs.iter().map(|d| d.to_string()).collect()))
.collect();
// Export topic metadata (name, ACL, max_entries, scope)
let topics = self.topics.iter()
.map(|(name, topic)| serialize_topic_metadata(topic))
.collect();
GossipState { vector_clock, subscriptions, topics }
}
Restore (restore_state()):
pub fn restore_state(&mut self, state: GossipState) -> Result<()> {
// Restore vector clock
for (did_str, count) in state.vector_clock {
let did = Did::from_str(&did_str)?;
self.clock.clock.insert(did, count);
}
// Restore topics (must happen before subscriptions)
for (_, topic_meta) in state.topics {
let topic = recreate_topic_from_metadata(topic_meta);
if !self.topics.contains_key(&topic.name) {
self.create_topic(topic);
}
}
// Restore subscriptions
for (topic, subs) in state.subscriptions {
for sub_str in subs {
let did = Did::from_str(&sub_str)?;
if let Some(sub_list) = self.subscriptions.get_mut(&topic) {
if !sub_list.contains(&did) {
sub_list.push(did);
}
}
}
}
Ok(())
}
What's NOT Persisted:
- Gossip entries (will be fetched from peers via anti-entropy)
- Bloom filters (reconstructed from entries)
- In-flight messages (acceptable loss)
NetworkActor Integration
Export (export_state()):
pub async fn export_state(&self) -> icn_snapshot::NetworkState {
// Export peer X25519 keys for end-to-end encryption
let peer_x25519_keys = self.peer_x25519_keys.read().await
.iter()
.map(|(did, key)| (did.to_string(), *key))
.collect();
// Peer addresses NOT exported (rediscovered via mDNS)
let peer_addresses = HashMap::new();
NetworkState { peer_x25519_keys, peer_addresses }
}
Restore (restore_state()):
pub async fn restore_state(&self, state: NetworkState) -> Result<()> {
// Restore peer X25519 keys
let mut keys = self.peer_x25519_keys.write().await;
for (did_str, key) in state.peer_x25519_keys {
let did = Did::from_str(&did_str)?;
keys.insert(did, key);
}
// Peer addresses NOT restored (mDNS will rediscover)
Ok(())
}
What's NOT Persisted:
- Active QUIC connections (re-established via discovery)
- Peer addresses (rediscovered via mDNS within ~5 seconds)
- Connection stats (acceptable reset)
Supervisor Integration
Startup Sequence (supervisor.rs):
- Create gossip actor
- Set keypair for signing
- Load snapshot (if exists)
- Restore gossip state
- Continue with ledger, network actors...
// Restore gossip state from snapshot if available
let data_dir = self.config.store_path();
if let Ok(Some(snapshot)) = icn_snapshot::load_snapshot(&data_dir) {
info!("Found state snapshot (version {}, created at {})",
snapshot.version, snapshot.created_at);
if let Some(gossip_state) = snapshot.gossip_state {
let mut gossip = gossip_handle.blocking_write();
if let Err(e) = gossip.restore_state(gossip_state) {
warn!("Failed to restore gossip state: {}", e);
} else {
info!("✅ Gossip state restored from snapshot");
}
}
}
Shutdown Sequence (supervisor.rs):
- Receive shutdown signal (Ctrl+C or broadcast)
- Export actor states
- Create StateSnapshot
- Save to disk (atomic write)
- Drop actors (graceful cleanup)
// Save state snapshot before actors are dropped
if gossip_handle.is_some() || network_handle.is_some() {
info!("Saving state snapshot before shutdown");
let mut snapshot = StateSnapshot::new();
// Export gossip state
if let Some(ref gossip_handle) = gossip_handle {
let gossip = gossip_handle.blocking_read();
snapshot.gossip_state = Some(gossip.export_state());
}
// Export network state (TODO: add NetworkHandle::export_state())
// Currently creates empty state
// Save snapshot to disk
let data_dir = self.config.store_path();
if let Err(e) = icn_snapshot::save_snapshot(&snapshot, &data_dir) {
warn!("Failed to save state snapshot: {}", e);
} else {
info!("✅ State snapshot saved to {}/state.snapshot", data_dir.display());
}
}
Implementation Challenges
Challenge 1: Circular Dependencies
Problem: icn-core depends on icn-net, but icn-net needs snapshot types, creating a cycle if snapshot types are in icn-core.
Solution: Created standalone icn-snapshot crate with zero dependencies. Both icn-gossip and icn-net depend on icn-snapshot, while icn-core can also depend on it without cycles.
icn-snapshot (no deps)
↑ ↑
| |
icn-gossip icn-net
↑ ↑
| |
icn-core
Challenge 2: NetworkHandle API Design ✅ RESOLVED
Problem: NetworkActor has export_state() method, but it's not exposed via NetworkHandle (the public API).
Initial Workaround: Supervisor created empty NetworkState as placeholder.
Solution Implemented (2025-01-14):
Added export_state() and restore_state() methods directly to NetworkHandle:
impl NetworkHandle {
pub async fn export_state(&self) -> icn_snapshot::NetworkState {
// Direct access to peer_x25519_keys via Arc<RwLock>
let peer_x25519_keys = if let Some(ref keys) = self.peer_x25519_keys {
keys.read().await
.iter()
.map(|(did, key)| (did.to_string(), *key))
.collect()
} else {
std::collections::HashMap::new()
};
icn_snapshot::NetworkState {
peer_x25519_keys,
peer_addresses: HashMap::new(), // Rediscovered via mDNS
}
}
pub async fn restore_state(&self, state: icn_snapshot::NetworkState) -> Result<()> {
if let Some(ref keys) = self.peer_x25519_keys {
let mut keys_write = keys.write().await;
for (did_str, key) in state.peer_x25519_keys {
let did = Did::from_str(&did_str)?;
keys_write.insert(did, key);
}
}
Ok(())
}
}
Key Design Decision: Used direct Arc<RwLock> access instead of message passing. NetworkHandle already had peer_x25519_keys field, so no new NetworkMsg variant needed. This simplified the implementation significantly compared to the original plan.
Challenge 3: Blocking vs Async Context
Problem: Supervisor uses blocking_read() for gossip but NetworkActor methods are async.
Solution: Used tokio::task::block_in_place() to safely block async context:
let state = tokio::task::block_in_place(|| {
tokio::runtime::Handle::current().block_on(async {
network_actor.export_state().await
})
});
Testing Strategy
Unit Tests (in icn-snapshot/src/lib.rs):
- ✅
test_save_and_load_snapshot()- Round-trip serialization - ✅
test_load_nonexistent_snapshot()- Handles missing file - ✅
test_delete_snapshot()- Cleanup works - ✅
test_network_state()- Network state serialization
Integration Tests ✅ (in icn-core/tests/graceful_restart_integration.rs) (2025-01-14):
✅
test_graceful_restart_preserves_state()- Full gossip state restart workflow- Creates node with topic + subscription
- Publishes 3 messages (creates vector clock state)
- Saves snapshot to disk
- Simulates restart with new node instance
- Restores state from snapshot
- Verifies vector clock matches (count = 3)
- Verifies topic and subscription restored
- Publishes post-restart message
- Verifies vector clock increments from restored state (count = 4)
✅
test_x25519_keys_persist_across_restart()- Network state (X25519 keys) persistence- Creates two nodes, establishes connection
- Exchanges X25519 keys via Hello protocol
- Saves node1 snapshot
- Simulates node1 restart
- Restores state from snapshot
- Verifies X25519 key for node2 was persisted
- Compares original vs restored key (exact match)
Manual Testing:
# Terminal 1: Start node
cargo run --bin icnd
# Terminal 2: Subscribe to topic, check vector clock
icnctl gossip subscribe test:topic
# Terminal 1: Ctrl+C (graceful shutdown)
# Check logs: "✅ State snapshot saved to..."
# Restart node
cargo run --bin icnd
# Check logs: "Found state snapshot"
# "✅ Gossip state restored from snapshot"
# Terminal 2: Publish to topic
icnctl gossip publish test:topic "hello"
# Verify immediate delivery (no re-subscription needed)
Critical Security Fixes (Post-Implementation)
Date: 2025-01-14
Commit: ae925f0 - "fix: Critical security fixes for graceful restart state persistence"
After implementing the initial graceful restart feature, comprehensive code review identified two critical bugs:
Issue #1: AccessControl::Participants Data Loss (CRITICAL SECURITY BUG)
Problem: AccessControl::Participants serialization in export_state() was losing all participant DIDs:
// BUGGY CODE (before fix):
AccessControl::Participants(dids) => format!("Participants:{}", dids.len())
Impact:
- Private topics with participant-based ACLs became PUBLIC after restart
- All participant DIDs were lost during serialization
- Only the count was preserved, not the actual DIDs
- Security regression: unauthorized access to previously private topics
Fix (gossip.rs:1097-1101):
AccessControl::Participants(dids) => {
// Serialize all participant DIDs to preserve access control
let did_strs: Vec<String> = dids.iter().map(|d| d.to_string()).collect();
format!("Participants:[{}]", did_strs.join(","))
}
Deserialization Fix (gossip.rs:1155-1180):
- Parse
"Participants:[did1,did2,...]"format - Reconstruct exact participant list
- Fallback to Public with warning on parse failure
- Maintains security even with corrupted data
Test Coverage (gossip.rs:2101-2159):
test_participants_acl_persistence: Verifies all 3 participant DIDs preserved across export/restore- Validates exact DID matching (no data loss)
Issue #2: Silent Subscription Data Loss (RELIABILITY BUG)
Problem: Subscription restore in restore_state() silently dropped subscriptions:
// BUGGY CODE (before fix):
if let Some(sub_list) = self.subscriptions.get_mut(&topic) {
sub_list.push(did); // Only works if topic entry exists
}
// If topic not in subscriptions map, subscription is SILENTLY LOST
Impact:
- Subscriptions lost without warning if topic entry didn't exist
- Silent failures are debugging nightmares
- Users wouldn't know why subscriptions disappeared
Fix (gossip.rs:1209-1226):
// Warn if restoring subscriptions for a topic that wasn't in the snapshot
if !self.topics.contains_key(&topic) {
warn!("Restoring subscriptions for topic '{}' which was not in snapshot topics. \
Topic may have been deleted or snapshot may be corrupted.", topic);
}
// Ensure subscription list exists for this topic (create if missing)
let sub_list = self.subscriptions.entry(topic.clone()).or_insert_with(Vec::new);
// Add subscription without access control check (we trust persisted state)
if !sub_list.contains(&did) {
sub_list.push(did.clone());
}
Key Changes:
- Use
entry().or_insert_with(Vec::new)instead ofif let Some() - Never silently drop subscriptions
- Warn when topic not in snapshot (fail-loud debugging)
- Trust persisted state (skip access control on restore)
Test Coverage (gossip.rs:2161-2263):
test_subscription_restore_creates_missing_entries: Verifies subscriptions never silently droppedtest_subscription_restore_warns_on_missing_topic: Verifies warning logged for missing topics- Both tests ensure fail-loud behavior
Test Results:
- ✅ All 55 gossip unit tests pass
- ✅ All 2 graceful restart integration tests pass
- ✅ Security regression fixed and verified
Performance Considerations
Snapshot Size:
- Vector clocks: ~50 bytes per peer (DID string + u64)
- Subscriptions: ~50 bytes per subscription (topic + DID)
- Topic metadata: ~100 bytes per topic
- X25519 keys: 32 bytes per peer
Example: 100 peers, 10 topics, 50 subscriptions:
- Vector clocks: 100 * 50 = 5KB
- Subscriptions: 50 * 50 = 2.5KB
- Topics: 10 * 100 = 1KB
- X25519 keys: 100 * 32 = 3.2KB
- Total: ~12KB (negligible)
Save Time: <10ms (JSON serialization + atomic write) Load Time: <5ms (JSON deserialization) Impact on Shutdown: Minimal (happens before actors drop) Impact on Startup: Minimal (happens after actor creation)
Security Considerations
Snapshot Contents:
- ✅ Vector clocks: Public (DIDs + counters)
- ✅ Subscriptions: Public (topic names + DIDs)
- ✅ Topic metadata: Public (configuration)
- ✅ X25519 keys: PUBLIC keys (not secrets)
No Sensitive Data: Snapshot contains NO private keys, passphrases, or encrypted content.
File Permissions: Uses OS default permissions (could be tightened to 0600 for defense-in-depth).
Replay Attacks: Vector clocks PREVENT replay attacks (old messages are rejected based on causality).
Deployment Considerations
Snapshot Location: {data_dir}/state.snapshot
- Default:
~/.icn/state.snapshot - Configurable via config file
Upgrade Path:
- Snapshot format versioned (currently v1)
- Future migrations can detect version and upgrade
- Old snapshots can be deleted if incompatible
Backup Integration ✅ (2025-01-14):
icnctl backupincludesstate.snapshot(usesappend_dir_all()for entire data directory)icnctl restorerestores snapshot with all other state- Verification: Added
test_backup_includes_state_snapshot()test (commit 43a8acf)- Creates mock state.snapshot in data directory
- Verifies snapshot is in backup tarball
- Verifies snapshot is restored with correct content
- All 5 icnctl backup/restore tests pass ✅
- Implementation:
icnctl/src/main.rs:1797-1799(backup),1852-1859(restore) - Backup is atomic and includes checksum verification
- Force-restore creates backup of existing data before overwrite
Monitoring ✅ (2025-01-14):
- Log messages: "State snapshot saved", "State snapshot restored"
- Prometheus Metrics (Implemented):
icn_snapshot_save_duration_seconds- Histogram of save operation durationicn_snapshot_load_duration_seconds- Histogram of load operation durationicn_snapshot_save_total- Counter of successful savesicn_snapshot_load_total- Counter of successful loadsicn_snapshot_save_errors_total- Counter of save failuresicn_snapshot_load_errors_total- Counter of load failuresicn_snapshot_size_bytes- Gauge of snapshot file sizeicn_snapshot_gossip_vector_clock_entries- Gauge of vector clock entriesicn_snapshot_gossip_subscriptions- Gauge of subscriptionsicn_snapshot_gossip_topics- Gauge of topicsicn_snapshot_network_x25519_keys- Gauge of peer X25519 keys
- Implementation:
icn-obs/src/metrics.rs:341-385(descriptions),791-838(helpers) - Instrumentation:
supervisor.rs:128-170(load),714-748(save) - Optimization: Eliminated duplicate snapshot load (now load once, reuse for both actors)
- Future: Health check to warn if snapshot is very old (stale?)
Known Limitations
NetworkHandle API Incomplete✅ RESOLVED (2025-01-14)export_state()not exposed via handleImpact: X25519 keys NOT persisted yet- Fixed: Added
export_state()andrestore_state()methods to NetworkHandle - Status: X25519 keys now fully persisted across restarts
No Automatic Cleanup
- Old snapshots accumulate (one per shutdown)
- Could add: Keep only last N snapshots
- Could add: Delete on successful startup
No Corruption Detection
- JSON deserialization can fail silently
- Could add: Checksum verification
- Could add: Backup snapshot (
.snapshot.bak)
No Compression
- JSON is verbose (~12KB for 100 peers)
- Could add: gzip compression (would save ~70%)
- Trade-off: Human-readability vs size
Future Enhancements
Short-term (Next Sprint)
- Complete NetworkHandle state export ✅ (2025-01-14)
- Add integration test for restart workflow ✅ (2025-01-14)
- Add metrics for snapshot save/load time ✅ (2025-01-14)
- Verify backup/restore includes snapshot ✅ (2025-01-14)
Medium-term
- Add snapshot corruption detection (checksums)
- Implement automatic cleanup (keep last 3 snapshots)
- Add
icnctl snapshotcommands (create, restore, list, delete) - Test with 1000+ peers (stress test)
Long-term
- Optional compression (gzip)
- Schema migration framework (v1 → v2 → v3...)
- Snapshot rotation (time-based, size-based)
- Remote snapshot backup (S3, etc.)
Metrics
Lines of Code:
icn-snapshot/src/lib.rs: 288 lines (types + save/load + 4 tests)gossip.rsadditions: ~240 lines (export + restore + security fixes + 3 new tests)actor.rsadditions: ~70 lines (export + restore)supervisor.rsadditions: ~90 lines (load + save integration + metrics instrumentation)icn-obs/src/metrics.rsadditions: ~60 lines (11 new metrics + helpers)- Total: ~750 lines
Test Coverage:
- icn-snapshot: 4 unit tests ✅
- gossip: 55 unit tests (including 3 new security tests) ✅
- network: Export/restore implemented and integrated ✅ (2025-01-14)
- supervisor: Manual testing (graceful shutdown/restart)
- integration: 2 graceful restart integration tests ✅
Security Fixes (2025-01-14):
- Fixed AccessControl::Participants data loss (commit ae925f0) ✅
- Fixed silent subscription data loss (commit ae925f0) ✅
- Added 3 comprehensive security tests ✅
Monitoring (2025-01-14):
- Added 11 Prometheus metrics (commit 302f626) ✅
- Instrumented supervisor with timing and content tracking ✅
- Optimized startup (eliminated duplicate snapshot load) ✅
Build Time: No impact (builds in parallel) Runtime Overhead: <10ms on startup, <10ms on shutdown
Conclusion
Graceful restart is now PRODUCTION READY ✅ (2025-01-14)
Both gossip and network layers maintain state across restarts, preserving vector clock causality, topic subscriptions, and peer X25519 encryption keys. Critical security bugs have been fixed, and comprehensive monitoring has been implemented.
Key Benefits:
- ✅ No duplicate message processing (vector clocks preserved)
- ✅ No re-subscription required (subscriptions restored)
- ✅ Immediate encrypted communication (X25519 keys persisted)
- ✅ Faster restart (no full state resync, no key re-exchange)
- ✅ Production-ready (atomic writes, error handling, comprehensive logging)
- ✅ Security hardened (private topics stay private after restart)
- ✅ Fully monitored (11 Prometheus metrics for operational visibility)
Implementation Complete:
- ✅ GossipActor state export/restore (vector clocks, subscriptions, topics)
- ✅ NetworkHandle state export/restore (X25519 keys)
- ✅ Supervisor integration (startup load, shutdown save)
- ✅ Unit tests (icn-snapshot: 4, gossip: 55)
- ✅ Integration tests (2 graceful restart tests)
- ✅ Security fixes (AccessControl::Participants + subscription restore)
- ✅ Prometheus metrics (11 metrics: duration, counters, gauges)
- ✅ Performance optimization (eliminated duplicate snapshot load)
- ✅ Build verification (all tests pass)
Commits:
b14aa23- Initial graceful restart implementationf26eef2- NetworkHandle state export/restoreae925f0- Critical security fixes for state persistence302f626- Comprehensive metrics for monitoring43a8acf- Backup/restore verification teste1a136f- Fixed tar command error handling in backup tests0cf1d4e- Comprehensive graceful restart documentation in operations guide16348fd- Fixed backup file extension documentation bug (.tar.gz.age → .tar)74d3d2d- Signal handling and async context fixes (FINAL)
Signal Handling Implementation ✅ (2025-01-14, commit 74d3d2d):
The final piece required for production-ready graceful restart was proper signal handling. Without it, pkill -TERM icnd would kill the process without saving state.
Changes:
icnd/main.rs - Signal handlers:
- Added SIGTERM handler (Unix) via
tokio::signal::unix - Added SIGINT handler (Ctrl+C) via
tokio::signal::ctrl_c - Spawn runtime in background task to allow signal interception
- Send shutdown signal when SIGTERM/SIGINT received
- Wait for graceful shutdown completion before exit
- Cross-platform support (Unix + Windows)
- Added SIGTERM handler (Unix) via
supervisor.rs - Async context fix:
- Changed
blocking_write()→write().await(3 occurrences: lines 120, 151, 694) - Changed
blocking_read()→read().await(1 occurrence) - Fixed "Cannot block current thread from within runtime" panic
- Proper async/await for RwLock operations in tokio context
- Changed
runtime.rs - API addition:
- Added
shutdown_tx()public getter for shutdown signal - Enables external signal handling in main.rs
- Added
Testing:
End-to-end manual testing verified:
- Daemon starts successfully
- SIGTERM triggers graceful shutdown (logs show "Received SIGTERM, shutting down gracefully...")
- State snapshot saved in <1ms
- Daemon restarts and restores state (logs show "✅ Gossip state restored successfully")
- All integration tests pass (2/2 graceful restart tests)
- Metrics show:
icn_snapshot_load_duration_seconds= 49.5 microseconds (incredibly fast!)
Logs Example:
INFO icnd: Received SIGTERM, shutting down gracefully...
INFO icn_core::supervisor: Saving state snapshot before shutdown
INFO icn_core::supervisor: ✅ State snapshot saved to .../state.snapshot in 0.000s
INFO icnd: ICNd stopped
[After restart]
INFO icn_core::supervisor: Found state snapshot (version 1) - loaded in 0.000s
INFO icn_gossip::gossip: ✅ Gossip state restored successfully
INFO icn_net::actor: ✅ Restored 0 peer X25519 keys from snapshot
Production Deployment:
With systemd integration:
# Graceful restart (sends SIGTERM)
sudo systemctl restart icnd
# Graceful shutdown (sends SIGTERM)
sudo systemctl stop icnd
# Check logs for state persistence
journalctl -u icnd | grep -E "(snapshot|restored)"
Track B1: Operational Hardening - COMPLETE ✅ (2025-01-14)
All operational hardening features are now production-ready:
- ✅ Backup & Restore (encrypted tarballs with state.snapshot)
- ✅ Monitoring Dashboard (real-time web UI + health check endpoint)
- ✅ Incident Response Playbook (7 major incident procedures)
- ✅ Operations Guide (comprehensive day-to-day procedures)
- ✅ Protocol Version Validation (automatic version checks with metrics)
- ✅ Graceful Restart (signal handling + state persistence)
Next Steps:
Add comprehensive integration tests✅ DONEAdd metrics for snapshot save/load time✅ DONEVerify backup/restore includes snapshot file✅ DONEDocument operational procedures in operations guide✅ DONEAdd signal handling to icnd✅ DONE- Update ROADMAP.md to reflect Track B1 completion
- Consider Phase 13 (Economic Safety Rails) vs Track C (Pilot Community Selection)