ICN Replication Operations Guide

Phase 17: Storage Hardening & Replication Version: 1.0 Last Updated: 2025-11-24

Overview

ICN's replication system provides fault-tolerant data storage through trust-weighted peer replication. This guide covers configuration, monitoring, and troubleshooting for production deployments.

Architecture

Components

ReplicationManager Actor
- Background health monitoring (runs every 60 seconds by default)
- Detects under-replicated content (count < target)
- Selects trusted peers for replication
- Sends replica requests via gossip protocol
Gossip Protocol Extensions
- ReplicaRequest: Request peers to store content
- ReplicaOffer: Peer confirms willingness to replicate
- ReplicaStatus: Batch health status updates
Storage Layer
- Replica metadata tracking per content hash
- Health status: Healthy, Stale, Unreachable
- Last-seen timestamps for staleness detection

Data Flow

Content Published
    ↓
ReplicationManager detects under-replication
    ↓
Select trusted peers (trust-weighted)
    ↓
Send ReplicaRequest via gossip
    ↓
Peers respond with ReplicaOffer
    ↓
Metadata updated in store
    ↓
Health monitoring continues (60s loop)

Configuration

Default Configuration

The ReplicationManager uses sensible defaults suitable for most deployments:

ReplicationConfig {
    target_replicas: 3,                    // Aim for 3 copies
    min_trust_class: TrustClass::Partner,  // Require Partner+ trust (0.4+)
    health_check_interval_secs: 60,        // Check every minute
    stale_threshold_secs: 300,             // 5 minutes = Stale
    unreachable_threshold_secs: 900,       // 15 minutes = Unreachable
}

Custom Configuration

For production environments with specific requirements, configure via environment or code:

Environment Variables

# Increase replica target for critical data
export ICN_REPLICATION_TARGET_REPLICAS=5

# Require higher trust for replicas
export ICN_REPLICATION_MIN_TRUST_CLASS=Federated

# More frequent health checks
export ICN_REPLICATION_HEALTH_CHECK_INTERVAL_SECS=30

Programmatic Configuration

use icn_core::ReplicationConfig;
use icn_trust::TrustClass;

let config = ReplicationConfig {
    target_replicas: 5,
    min_trust_class: TrustClass::Federated,
    health_check_interval_secs: 30,
    stale_threshold_secs: 600,              // 10 minutes
    unreachable_threshold_secs: 1800,       // 30 minutes
};

// ReplicationManager spawned with custom config in supervisor

Trust Classes and Selection

Trust Class Hierarchy

Trust Class	Score Range	Typical Relationship
Isolated	0.0 - 0.1	Unknown, untrusted
Known	0.1 - 0.4	Basic interaction
Partner	0.4 - 0.7	Default minimum for replicas
Federated	0.7 - 1.0	Deep collaboration

Selection Algorithm

When ReplicationManager detects under-replication:

Query gossip for known peers (subscribers + vector clock peers)
Filter by minimum trust class (default: Partner+)
Exclude existing replicas (avoid duplicates)
Sort by trust score (descending - higher trust first)
Select top N candidates (limit to 5 requests per check)

Example:

Peers:
- Alice: trust=0.2 (Known) → REJECTED (below Partner threshold)
- Bob: trust=0.5 (Partner) → Selected #3
- Carol: trust=0.6 (Partner) → Selected #2
- Dave: trust=0.8 (Federated) → Selected #1 (highest trust)

Request order: Dave, Carol, Bob (up to 5 total)

Health Monitoring

Health States

State	Condition	Action
Healthy	last_seen < 5 minutes	Normal operation
Stale	last_seen 5-15 minutes	Warning, may need refresh
Unreachable	last_seen > 15 minutes	Likely offline, trigger re-replication

Monitoring Loop

Every 60 seconds (configurable), ReplicationManager:

Loads all content hashes with replica metadata
For each content hash:
- Updates replica health based on last_seen timestamps
- Counts healthy replicas
- If count < target_replicas: trigger re-replication
Logs health check results

Request Rate Limiting

To avoid spam, ReplicationManager tracks recent requests:

Cooldown period: 5 minutes minimum between requests for same content
Request limit: 5 peers per health check cycle
Exponential backoff: If no candidates found, wait before retrying

Prometheus Metrics

Current Metrics (Week 3)

Metric	Type	Description
`icn_replication_health_checks_total`	Counter	Total health checks run
`icn_replication_under_replicated_total`	Counter	Content items below target
`icn_replication_requests_sent_total`	Counter	Replica requests sent

Planned Metrics (Week 4)

Metric	Type	Description
`icn_replication_replica_count{content_hash}`	Gauge	Current replica count per hash
`icn_replication_health_check_duration_seconds`	Histogram	Health check execution time
`icn_replication_offers_received_total`	Counter	Replica offers from peers
`icn_replication_failures_total{reason}`	Counter	Failed replication attempts

Monitoring Dashboard (Example)

# Under-replicated content alert
rate(icn_replication_under_replicated_total[5m]) > 0

# Average replica count across all content
avg(icn_replication_replica_count)

# Replication request success rate
rate(icn_replication_offers_received_total[5m])
  / rate(icn_replication_requests_sent_total[5m])

Operational Procedures

Checking Replication Health

Via Prometheus:

curl http://localhost:9100/metrics | grep icn_replication

Via Logs:

# Look for health check summaries
journalctl -u icnd | grep "Replication health check complete"

# Example output:
# Replication health check complete: 127 healthy, 3 under-replicated

Increasing Replica Count

If data loss is a concern, increase target replicas:

Method 1: Environment Variable (restart required)

export ICN_REPLICATION_TARGET_REPLICAS=5
systemctl restart icnd

Method 2: Runtime Configuration (future)

# Planned for Phase 18
icnctl replication set-target 5

Handling Under-Replication Alerts

Diagnosis:

Check peer connectivity: icnctl network peers
Check trust relationships: icnctl trust list
Check recent replica requests in logs

Resolution:

No trusted peers: Add trust relationships with icnctl trust attest
Peers offline: Wait for peers to reconnect, or find new peers
Insufficient capacity: Peers may be at storage limits (Phase 18: quotas)

Manual Re-Replication (Emergency)

If automatic re-replication fails:

# Force health check (triggers re-replication immediately)
# Planned for Phase 17 Week 4
icnctl replication health-check

# Query replica status for specific content
icnctl replication status <content_hash>

# Manually request replica from specific peer
icnctl replication request <content_hash> <peer_did>

Troubleshooting

Symptom: Replication requests never sent

Possible Causes:

No trusted peers at Partner+ level
All peers already have replicas
Request cooldown period active (5 minute minimum)

Diagnosis:

# Check trust graph
icnctl trust list

# Check known peers
icnctl network peers

# Check recent requests in logs
journalctl -u icnd | grep "ReplicaRequest"

Solution:

Add trust relationships with peers
Lower min_trust_class (not recommended for production)
Wait for cooldown period to expire

Symptom: High under-replication count

Possible Causes:

Network partition (peers offline)
Storage capacity exhaustion (Phase 18)
Insufficient trusted peers in network

Diagnosis:

# Check network connectivity
ping <peer_ip>
icnctl network dial <peer_did>

# Check peer trust levels
icnctl trust compute <peer_did>

Solution:

Increase target_replicas temporarily to spread load
Add more high-trust peers to network
Investigate peer storage capacity (Phase 18: quotas)

Symptom: Replicas marked as Stale/Unreachable

Possible Causes:

Peer temporarily offline (network issue)
Peer crashed (process died)
Clock drift (timestamps incorrect)

Diagnosis:

# Check if peer is responsive
icnctl network ping <peer_did>

# Check peer's last message timestamp
journalctl -u icnd | grep <peer_did>

Solution:

Wait for peer to reconnect (automatic health recovery)
If peer permanently offline, ReplicationManager will request new replicas
Check system clocks (Phase 19: clock synchronization)

Best Practices

Production Deployments

Set target_replicas ≥ 3
- Survives single node failure
- 99.9% durability with proper distribution
Maintain Partner+ trust with multiple peers
- Minimum 5 Partner+ peers for redundancy
- Distribute peers across regions/operators
Monitor under-replication alerts
- Alert when under-replication > 5% of content
- Daily health check summaries
Regular trust graph maintenance
- Review trust relationships quarterly
- Remove trust for inactive peers
- Add trust for new collaborators

Development/Testing

Use lower health_check_interval_secs
- 10-30 seconds for faster iteration
- Reset to 60s for production
Use lower target_replicas
- 2 replicas sufficient for testing
- Reduces resource usage in dev environments
Test failure scenarios
- Kill node processes (simulate crash)
- Partition networks (simulate network failure)
- Verify automatic recovery

Performance Considerations

Network Overhead

Each replica request incurs:

1 ReplicaRequest message (~100 bytes)
1 ReplicaOffer response (~150 bytes)
Total: ~250 bytes per request

Typical overhead for 100 content items:

100 items × 2 under-replicated × 3 requests/item = 600 requests
600 requests × 250 bytes = 150 KB bandwidth
Spread over 60 seconds = 2.5 KB/sec (negligible)

Storage Overhead

Replica metadata per content hash:

Content hash: 32 bytes
Per-replica: ~80 bytes (DID + timestamp + health)
Total: 32 + (80 × replicas) bytes

Example for 1,000 content items with 3 replicas each:

1,000 × (32 + 80×3) = 272 KB metadata
Negligible compared to actual content size

CPU Overhead

Health check cycle (every 60 seconds):

Load all replica metadata: ~1ms per 100 items
Trust score lookups: ~10ms per 100 peers
Total: ~11ms per cycle (0.018% CPU usage)

Security Considerations

Trust Requirements

Why Partner+ trust for replicas?

Prevents Sybil attacks (fake nodes offering replicas)
Ensures data stored with reliable, known peers
Maintains consistency with trust graph semantics

Lowering min_trust_class risks:

Known (0.1-0.4): Higher risk of unreliable replicas
Isolated (0.0-0.1): NOT RECOMMENDED - no trust relationship

Replica Verification

Current (Phase 17):

Trust-based replica selection
Health monitoring via last_seen timestamps

Future (Phase 18):

Cryptographic verification of stored content
Challenge-response proofs of storage
Byzantine fault detection for corrupt replicas

Future Enhancements

Phase 18: Pre-Pilot Hardening

Storage quotas: Per-peer capacity limits
Byzantine detection: Identify corrupt/malicious replicas
Conflict resolution: Handle replica disagreements

Phase 19: Post-Pilot Improvements

Geo-diverse replication: Spread replicas across regions
Erasure coding: N-of-M recovery (reduce storage 3x)
Priority-based replication: Critical data replicated first

Support

For operational questions or issues:

GitHub Issues: https://github.com/InterCooperative-Network/icn/issues
Development sessions: docs/development/sessions/ (detailed implementation notes)
Metrics dashboard: http://localhost:9100/metrics (Prometheus)

Document Version: 1.0 (Phase 17 Week 4) Next Review: After Phase 17 completion and pilot deployment feedback

ICN Replication Operations Guide

Overview

Architecture

Components

Data Flow

Configuration

Default Configuration

Custom Configuration

Environment Variables

Programmatic Configuration

Trust Classes and Selection

Trust Class Hierarchy

Selection Algorithm

Health Monitoring

Health States

Monitoring Loop

Request Rate Limiting

Prometheus Metrics

Current Metrics (Week 3)

Planned Metrics (Week 4)

Monitoring Dashboard (Example)

Operational Procedures

Checking Replication Health

Increasing Replica Count

Handling Under-Replication Alerts

Manual Re-Replication (Emergency)

Troubleshooting

Symptom: Replication requests never sent

Symptom: High under-replication count

Symptom: Replicas marked as Stale/Unreachable

Best Practices

Production Deployments

Development/Testing

Performance Considerations

Network Overhead

Storage Overhead

CPU Overhead

Security Considerations

Trust Requirements

Replica Verification

Future Enhancements

Phase 18: Pre-Pilot Hardening

Phase 19: Post-Pilot Improvements

Related Documentation

Support