⚠️ ARCHIVED - This document is from 2025 and has been archived.

For current information, see:


ICN System Gaps - Comprehensive Analysis

Date: 2025-12-06 Updated: 2025-12-13 Purpose: Complete inventory of incomplete functionality, architectural issues, and design oversights Scope: Core systems only (not pilot UX gaps)


Executive Summary

Deep audit of ICN core systems revealed 47 significant gaps across:

  • 15 incomplete features (TODOs/stubs)
  • 12 architectural issues (coupling, layering, ownership)
  • 11 consistency/race conditions
  • 9 trust enforcement gaps

Critical Finding: The infrastructure is ~90% complete, but the remaining 10% includes critical consistency bugs and security model gaps that would cause production failures.

Update 2025-12-07: All 8 Critical issues and all 9 High priority issues have been addressed. See status below. Update 2025-12-13: C3 (Actor Pause/Resume) verified as implemented. M8 (Floating Point Selection) fixed with deterministic tie-breaking. M3 (Dead-Letter Queue) implemented. M6 (Fork Detection) and A7 (Panics) verified as non-issues. M2 (Profile Query Responses) implemented. M4 (Executor Capacity) implemented. A1 Phase 1 complete (supervisor modularization started). M1 (TURN Relay) implemented with RFC 5766 protocol. M7 (Balance Recomputation Race) fixed with journal versioning. A1 Phase 2 (init_rpc.rs extraction) complete. M5 (Locality/RTT) integrated via LocalityCallback. M9 (Deliberation Clock Skew) fixed with relative timing. A5 (Configuration Sprawl) fixed with SupervisorConfig struct. A6 (Error Swallowing) fixed with supervisor error metrics for observability. A2 (Circular Dependencies) verified as non-issue - dependencies form DAG. A3 (Trust Graph), A4 (Callback Patterns), A8 (Byzantine Detector) verified as appropriate patterns.


Priority 1: CRITICAL (Production Blockers)

These must be fixed before any real-world use.

C1. Ledger Rollback Not Implemented - FIXED

Location: icn-core/src/supervisor.rs:2206, icn-ledger/src/ledger.rs Issue: Governance proposals for ledger rollback are accepted but never executed Impact: Emergency recovery impossible Fix: Implemented Ledger::rollback_to_entry() method with archive storage, balance recomputation, fork index rebuild, and gossip notification. Supervisor now executes rollback when governance proposal is accepted.

C2. Dispute Resolution Not Executed - FIXED

Location: icn-core/src/supervisor.rs:2276 Issue: DisputeResolution proposals accepted but not applied to ledger Impact: Accepted dispute decisions have no effect Fix: Supervisor now maps governance DisputeResolutionOutcome to ledger DisputeOutcome and calls DisputeManager::resolve_escalated_dispute() when proposals are accepted.

C3. Actor Pause/Resume Missing (Compute Migration) - FIXED

Location: icn-compute/src/migration_manager.rs, icn-compute/src/actor_runtime.rs Issue: Actor migration has TODO for "Week 4 integration" - no pause/resume Impact: Live migration will corrupt actor state Fix: Implemented actor execution control. MigrationManager has pause_actor(), resume_actor(), and restore_actor() methods that send ActorRuntimeCommand variants. StatefulActorRegistry handles these commands via create_callback() integration. Full state machine: Running → Paused → Migrating → (transferred to target) with proper checkpoint coordination.

C4. Gossip Handle Race in Ledger - FIXED

Location: icn-ledger/src/ledger.rs Issue: gossip.take() during entry append creates window where new entries silently fail to publish Impact: Entries stored locally but never propagated → split-brain Fix: Added append_entry_from_sync() method to avoid re-broadcasting entries received from gossip. Removed dangerous .take() pattern.

C5. Trust Penalty Callback Race - FIXED

Location: icn-core/src/supervisor.rs:161-193 Issue: tokio::spawn() without await means trust updates race with gossip updates Impact: Trust scores diverge across network Fix: Changed from fire-and-forget tokio::spawn to synchronous tokio::task::block_in_place for trust penalty callback, ensuring updates complete before returning.

C6. Vote Tally Not Synchronized - VERIFIED IMPLEMENTED

Location: icn-core/src/governance/actor.rs:512-559 Issue: Tally computed on-demand, not persisted. Different nodes see different counts. Impact: Governance proposals may pass on some nodes, fail on others Status: Already implemented. Governance actor computes tally when closing proposals and broadcasts ProposalClosed message with canonical TallySnapshot via gossip.

C7. Proposal Outcome Not Gossiped - VERIFIED IMPLEMENTED

Location: icn-core/src/governance/actor.rs:553-559, icn-governance/src/message.rs Issue: When proposal closes, outcome is local only Impact: Nodes don't know final governance decisions Status: Already implemented. GovernanceMessage::ProposalClosed variant includes outcome and tally snapshot. Receiving nodes handle and store the outcome (lines 776-791).

C8. RPC/Gateway Has No Trust-Based Rate Limiting - FIXED

Location: icn-rpc/src/server.rs, icn-rpc/src/auth.rs Issue: All authenticated users get same rate limits regardless of trust Impact: Low-trust peers can spam API Fix: Added trust-gated rate limiter to RPC server using icn_net::RateLimiter. Different trust levels get different limits (Isolated: 10/sec, Known: 50/sec, Partner: 100/sec, Federated: 200/sec). Enabled automatically in supervisor.


Priority 2: HIGH (Correctness Issues)

These cause incorrect behavior but may not immediately crash the system.

H1. Configuration Changes Not Applied - FIXED

Location: icn-core/src/supervisor.rs:2083 Issue: ConfigChange proposals accepted but never take effect Fix: Implemented config update execution. Supervisor now parses new_config JSON string into GovernanceConfig, calls GovernanceCommand::UpdateDomainConfig which updates the domain and broadcasts DomainUpdated message via gossip.

H2. Membership Updates Not Executed - FIXED

Location: icn-core/src/supervisor.rs:2089 Issue: Member add/remove proposals don't modify actual membership Fix: Implemented membership update execution. Added GovernanceCommand::UpdateMembership which adds/removes members from MembershipSource::StaticList. Supervisor calls this when Membership proposals are accepted. Gossip broadcasts DomainUpdated.

H3. Replica Threshold Never Checked - FIXED

Location: icn-gossip/src/gossip.rs:1573 Issue: Phase 17 incomplete - replica count below threshold not detected Fix: Added immediate replica threshold check during ReplicaStatus message handling. Now checks if healthy replica count < 3, emits content_under_replicated_detected_total metric, and logs warning. ReplicationManager handles remediation via its periodic health check.

H4. Partition Healing Incomplete - FIXED

Location: icn-gossip/src/gossip.rs:248 Issue: TODO for PartitionHealRequest/Response - uses empty VectorClock Impact: Partitions detected but not actually healed Fix: heal_partition_with_peer() now sends PartitionHealRequest with actual vector clock. Added mark_healing_started/complete to PartitionHealer for tracking. Response handler merges clocks and requests diverged entries.

H5. Ledger Entry Acceptance Has No Trust Check - FIXED

Location: icn-ledger/src/ledger.rs - append_entry() Issue: Credit limits use trust, but entry acceptance doesn't validate trust Impact: Malicious peers can spam ledger up to credit limit Fix: Added trust_graph and min_trust_for_entry fields to Ledger. append_entry_internal() now validates author's trust score against threshold (default 0.1 = Known class). Rejects entries from low-trust authors with metric tracking.

H6. Default Trust Thresholds Too Permissive - FIXED

Locations: TLS (0.0), Compute (0.0) Issue: Default accepts everyone with valid DID Fix: Updated TrustGatedRateLimitConfig.min_trust_threshold default from 0.0 to 0.1 (Known trust class minimum). Added warning in TLS fallback path to indicate development-only mode. Compute already uses proper defaults (MIN_TRUST_SUBMIT=0.1, MIN_TRUST_EXECUTE=0.3).

H7. Gossip Messages Not Trust-Gated - FIXED

Location: icn-gossip/src/gossip.rs Issue: Subscriptions check trust, but message flow doesn't Fix: Added trust validation at start of handle_message(). Messages from senders with trust < 0.1 (Known class) are rejected. Unknown senders are also rejected. Metric messages_rejected_low_trust_total tracks rejections.

H8. Vector Clock Merge Missing Conflict Data - FIXED

Location: icn-gossip/src/partition.rs:145-189 Issue: Merge returns version numbers but no actual conflict entries Fix: Created VersionGap struct with author_did, local_version, remote_version, detected_at timestamp, and GapDirection (RemoteAhead/LocalAhead/Diverged). Merge now returns Vec<VersionGap> with full context. Added merge_simple() for backward compatibility.

H9. Task Completion Not Published - VERIFIED IMPLEMENTED

Location: icn-compute/src/actor.rs:1424-1427 Issue: Status updated locally, never gossiped Status: Already implemented. ComputeActor broadcasts ComputeMessage::TaskResult via send_callback after local execution (line 1425). It also broadcasts TaskCancelled for cancellations (line 1004) and TaskResult with timeout outcome for deadline failures (line 825). The task.rs file only manages local state; gossip publishing is correctly handled at the actor level.


Priority 3: MEDIUM (Quality/Reliability)

These affect robustness but system can function.

M1. NAT Traversal Relay Fallback Missing - FIXED

Location: icn-net/src/turn.rs, icn-net/src/session.rs Issue: TURN relay not implemented (Phase 4 TODO) Impact: Nodes behind symmetric NAT can't connect Fix: Implemented TurnClient with RFC 5766 protocol (allocate, refresh, create_permission). Added TurnConfig with builder pattern to NetworkConfig. SessionManager creates allocation on startup if configured and includes relay address in connection candidates. Added TURN metrics.

M2. Profile Query Responses Not Implemented - FIXED

Location: icn-core/src/supervisor/mod.rs:1363 Issue: Profile queries received but not answered Fix: Implemented profile query response handler. When a Query message is received, looks up the requested DID (own profile or cached peer profile) and publishes a Response message via gossip.

M3. Dead-Letter Queue Missing - FIXED

Location: icn-core/src/dead_letter.rs Issue: Failed ledger entries logged but no recovery path Fix: Implemented DeadLetterQueue with persistent storage, failure type tracking, retry support, and Prometheus metrics. Provides FailedOperation entries with context for manual review or automated retry.

M4. Executor Capacity Not Tracked - FIXED

Location: icn-compute/src/actor.rs:2206 Issue: Scheduler can't make informed placement decisions Fix: Added capacity field to ExecutorInfo struct. on_capacity_announce() now stores capacity in the executor registry. Added get_executor_capacity() and get_all_executor_capacities() methods for scheduler placement decisions.

M5. Locality/Region Constraints Incomplete - FIXED

Location: icn-compute/src/actor.rs:1931-1937, icn-core/src/supervisor/mod.rs:2613-2643 Issue: Network RTT and blob registry integration missing Fix: Added LocalityCallback type to ComputeActor that queries network topology for RTT data. Supervisor wires up callback to NeighborSets for live RTT lookup. Placement scoring now uses real network latency data when available.

M6. Fork Detection Index Not Atomic - VERIFIED NON-ISSUE

Location: icn-ledger/src/ledger.rs:119-176 Issue: Entry stored before fork index updated - crash window Status: The ForkDetector is an in-memory structure that is rebuilt from persistent entries on startup via rebuild_fork_index(). Any crash window is recovered on restart. Not a data consistency issue.

M7. Balance Recomputation Race - FIXED

Location: icn-ledger/src/ledger.rs:531-578 Issue: Full recompute during quarantine can cause lost updates Fix: Added journal_version tracking to Ledger. recompute_balances() validates snapshot isolation via version check before applying. Added recompute_balances_with_retry() convenience method that retries on version mismatch.

M8. Floating Point Offer Selection - FIXED

Location: icn-compute/src/actor.rs:2118-2126 Issue: f64 comparison non-deterministic across platforms Fix: Implemented deterministic tie-breaking with epsilon-based float comparison (1e-9 threshold) and lexicographic DID comparison as tie-breaker for equal scores.

M9. Deliberation Period Clock Skew - FIXED

Location: icn-compute/src/actor.rs:2052-2063 Issue: 500ms wait uses local wall-clock, not synchronized Fix: Implemented relative timing based on requested_at timestamp from PlacementRequest. Executors calculate deadline = requested_at + DELIBERATION_PERIOD_MS and wait only the remaining time, ensuring all executors broadcast at approximately the same wall-clock time regardless of network latency.


Priority 4: ARCHITECTURAL (Technical Debt)

These don't cause immediate bugs but make the system harder to maintain.

A1. Supervisor God Object - IN PROGRESS

Location: icn-core/src/supervisor/ (now modular) Issue: Creates, wires, and manages 12+ subsystems with 38+ lock acquisitions Impact: Can't test components in isolation, high-risk changes Status: Phase 1 complete (2025-12-13). Extracted to supervisor/ directory with modules:

  • init_trust.rs - Trust graph and misbehavior detector
  • init_gossip.rs - Gossip actor, partitions, replication
  • init_ledger.rs - Ledger, disputes, contracts
  • registry.rs - Service container types
  • shutdown.rs - Graceful shutdown and snapshot management
  • mod.rs - Main supervisor (reduced from 3571 to 3256 lines, -315 lines) Remaining: Network/message handlers, governance subscriptions, compute callbacks have deeply embedded closures requiring fuller ServiceRegistry integration

A2. Circular Crate Dependencies - VERIFIED NON-ISSUE

Locations: icn-net ↔ icn-gossip ↔ icn-ledger Claimed Issue: Can't version or update crates independently Verification: Analyzed with cargo tree. Dependencies form a DAG, not a cycle:

  • icn-net → icn-ledger, icn-gossip
  • icn-ledger → icn-gossip (no reverse dependency)
  • icn-gossip → icn-identity, icn-trust, etc. (no dependency on icn-net or icn-ledger) No circular dependencies exist. Crates can be versioned independently.

A3. Multiple Sources of Truth (Trust Graph) - VERIFIED APPROPRIATE PATTERN

Issue: Trust graph shared via Arc<RwLock<>> to 6+ actors without coordination Analysis: Arc<RwLock<>> IS the coordination mechanism. This is the standard Rust pattern for shared mutable state across actors. The RwLock provides:

  • Multiple concurrent readers
  • Exclusive writer access
  • Automatic coordination via lock acquisition Alternative patterns (message-passing, CRDT) would add complexity without benefit for in-process actors.

A4. Inconsistent Callback Patterns - VERIFIED APPROPRIATE PATTERN

Issue: Each actor defines own callback types, no common abstraction Analysis: Actors have different callback needs (different input/output types, sync vs async). A common ActorCallback trait would require either:

  • Excessive generics that hurt ergonomics
  • Runtime type erasure that loses type safety The current approach provides type-safe, actor-specific callbacks. This is idiomatic Rust.

A5. Configuration Sprawl - FIXED

Issue: Hardcoded values scattered across supervisor.rs Fix: Created SupervisorConfig struct in config.rs centralizing:

  • candidate_cleanup_interval_secs (default: 300)
  • peer_exchange_delay_ms (default: 500)
  • peer_exchange_max_peers (default: 50)
  • metrics_update_interval_secs (default: 10)
  • shutdown_timeout_secs (default: 5)
  • clock_sync_interval_secs (default: 600) Updated supervisor.rs to use config values instead of hardcoded literals.

A6. Error Swallowing - FIXED (OBSERVABILITY APPROACH)

Locations: 8+ places in supervisor.rs Issue: Errors logged but not propagated Analysis: Most errors occur in async contexts (background tasks, notification handlers) where there's no caller to propagate to. Logging is the appropriate pattern for these cases. Fix: Added icn_obs::metrics::supervisor module with observability metrics:

  • icn_supervisor_errors_total{operation} - Error counter by operation type
  • icn_supervisor_state - State gauge (0=stopped, 1=starting, 2=running, 3=stopping)
  • icn_supervisor_actors_spawned_total{actor} - Actor spawn counter
  • icn_supervisor_actor_spawn_failures_total{actor} - Spawn failure counter Key error locations instrumented: metrics_server_start, rpc_server, gateway_server, identity_bundle_missing, gateway_jwt_secret_missing, shutdown_timeout. Errors are now alertable via Prometheus.

A7. Panic! in Production Code - VERIFIED NON-ISSUE

Locations: icn-ledger/sync.rs:86, icn-ledger/dispute.rs:553,625, icn-net/protocol.rs (6 places) Issue: Panics instead of error returns Status: All reported panics are inside #[cfg(test)] modules (test code only). No panics exist in production code paths. Verified 2025-12-13.

A8. Byzantine Detector Ownership Unclear - VERIFIED APPROPRIATE PATTERN

Issue: Created in supervisor, shared to Network, Gossip, Ledger Analysis: Ownership IS explicit:

  • Created: init_trust.rs line 56-64, part of TrustServices
  • Owned: Supervisor holds the Arc<RwLock>
  • Shared: Passed to NetworkActor, GossipActor, LedgerDeps, ComputeActor via clone() This is the correct pattern for a component that:
  • Aggregates misbehavior reports from multiple sources
  • Provides unified tracking across actors
  • Applies trust penalties when thresholds are exceeded Well-documented in init_trust.rs with comments explaining the shared ownership design.

Gap Summary by System

System Critical High Medium Arch Total
Ledger 1 2 2 0 5
Governance 2 2 0 0 4
Trust 1 3 0 1 5
Gossip 0 2 0 0 2
Compute 1 1 4 0 6
Network 0 0 1 0 1
RPC/Gateway 1 0 0 0 1
Core/Supervisor 2 2 2 7 13
Total 8 12 9 8 47

Recommended Fix Order

Week 1: Critical Consistency Fixes

  1. C4 - Ledger gossip handle race (prevents split-brain)
  2. C5 - Trust penalty callback race (prevents trust divergence)
  3. C6 - Vote tally synchronization (governance correctness)
  4. C7 - Proposal outcome gossip (governance visibility)

Week 2: Critical Feature Completion

  1. C1 - Ledger rollback implementation
  2. C2 - Dispute resolution execution
  3. C8 - Trust-based API rate limiting

Week 3: High Priority Correctness

  1. H4 - Partition healing protocol
  2. H5 - Ledger entry trust validation
  3. H7 - Gossip message trust gating
  4. H1/H2 - Config and membership updates

Week 4: Compute Layer Completion

  1. C3 - Actor pause/resume
  2. H9 - Task completion gossip
  3. M4/M5 - Executor capacity and locality

Week 5+: Architectural Cleanup

  1. A1 - Supervisor refactoring (incremental)
  2. A6/A7 - Error handling cleanup
  3. A2 - Crate dependency cleanup

Test Coverage Needed

// Critical consistency tests
test_ledger_concurrent_append_with_gossip()
test_trust_penalty_vs_gossip_race()
test_proposal_tally_consistency_across_nodes()
test_partition_heal_with_conflicting_entries()
test_task_completion_both_nodes_agree()

// Trust enforcement tests
test_ledger_entry_rejected_low_trust()
test_api_rate_limited_by_trust_class()
test_gossip_message_rejected_low_trust()

// Governance tests
test_proposal_outcome_gossip_propagation()
test_vote_ordering_deterministic()

// Compute tests
test_actor_migration_pause_resume()
test_task_status_gossip_sync()

What's Actually Working Well

Despite the gaps, these systems are solid:

  • Identity & Keystore: Multi-device, age-encrypted, migrations work
  • Network Layer: QUIC/TLS, rate limiting, signed envelopes all good
  • Gossip Core: Vector clocks, subscriptions, anti-entropy work
  • Ledger Core: Double-entry, Merkle-DAG, credit limits work
  • Contract Execution: CCL interpreter, fuel metering work
  • Security Detection: Byzantine detection, reputation, quarantine work
  • Gateway API: REST/WebSocket endpoints, JWT auth work

The gaps are in integration (systems don't talk to each other correctly) and edge cases (concurrent operations, failure recovery).


Conclusion

The ICN codebase is architecturally sound but has critical integration gaps. The most dangerous issues are:

  1. Consistency bugs that cause split-brain (ledger, trust, governance)
  2. Trust enforcement gaps that undermine the security model
  3. Incomplete features marked TODO that are assumed working

Fixing the 8 Critical items is essential before any production use. The 12 High items should follow. The Medium and Architectural items can be addressed incrementally.

Estimated effort: 4-5 weeks for Critical + High priority fixes.