⚠️ ARCHIVED - This document is from 2025 and has been archived.
For current information, see:
- STATE.md - Current project state
- TODO.md - Current tasks
- ARCHITECTURE.md - Current architecture
ICN System Gaps - Comprehensive Analysis
Date: 2025-12-06 Updated: 2025-12-13 Purpose: Complete inventory of incomplete functionality, architectural issues, and design oversights Scope: Core systems only (not pilot UX gaps)
Executive Summary
Deep audit of ICN core systems revealed 47 significant gaps across:
- 15 incomplete features (TODOs/stubs)
- 12 architectural issues (coupling, layering, ownership)
- 11 consistency/race conditions
- 9 trust enforcement gaps
Critical Finding: The infrastructure is ~90% complete, but the remaining 10% includes critical consistency bugs and security model gaps that would cause production failures.
Update 2025-12-07: All 8 Critical issues and all 9 High priority issues have been addressed. See status below. Update 2025-12-13: C3 (Actor Pause/Resume) verified as implemented. M8 (Floating Point Selection) fixed with deterministic tie-breaking. M3 (Dead-Letter Queue) implemented. M6 (Fork Detection) and A7 (Panics) verified as non-issues. M2 (Profile Query Responses) implemented. M4 (Executor Capacity) implemented. A1 Phase 1 complete (supervisor modularization started). M1 (TURN Relay) implemented with RFC 5766 protocol. M7 (Balance Recomputation Race) fixed with journal versioning. A1 Phase 2 (init_rpc.rs extraction) complete. M5 (Locality/RTT) integrated via LocalityCallback. M9 (Deliberation Clock Skew) fixed with relative timing. A5 (Configuration Sprawl) fixed with SupervisorConfig struct. A6 (Error Swallowing) fixed with supervisor error metrics for observability. A2 (Circular Dependencies) verified as non-issue - dependencies form DAG. A3 (Trust Graph), A4 (Callback Patterns), A8 (Byzantine Detector) verified as appropriate patterns.
Priority 1: CRITICAL (Production Blockers)
These must be fixed before any real-world use.
C1. Ledger Rollback Not Implemented - FIXED
Location: icn-core/src/supervisor.rs:2206, icn-ledger/src/ledger.rs
Issue: Governance proposals for ledger rollback are accepted but never executed
Impact: Emergency recovery impossible
Fix: Implemented Ledger::rollback_to_entry() method with archive storage, balance recomputation, fork index rebuild, and gossip notification. Supervisor now executes rollback when governance proposal is accepted.
C2. Dispute Resolution Not Executed - FIXED
Location: icn-core/src/supervisor.rs:2276
Issue: DisputeResolution proposals accepted but not applied to ledger
Impact: Accepted dispute decisions have no effect
Fix: Supervisor now maps governance DisputeResolutionOutcome to ledger DisputeOutcome and calls DisputeManager::resolve_escalated_dispute() when proposals are accepted.
C3. Actor Pause/Resume Missing (Compute Migration) - FIXED
Location: icn-compute/src/migration_manager.rs, icn-compute/src/actor_runtime.rs
Issue: Actor migration has TODO for "Week 4 integration" - no pause/resume
Impact: Live migration will corrupt actor state
Fix: Implemented actor execution control. MigrationManager has pause_actor(), resume_actor(), and restore_actor() methods that send ActorRuntimeCommand variants. StatefulActorRegistry handles these commands via create_callback() integration. Full state machine: Running → Paused → Migrating → (transferred to target) with proper checkpoint coordination.
C4. Gossip Handle Race in Ledger - FIXED
Location: icn-ledger/src/ledger.rs
Issue: gossip.take() during entry append creates window where new entries silently fail to publish
Impact: Entries stored locally but never propagated → split-brain
Fix: Added append_entry_from_sync() method to avoid re-broadcasting entries received from gossip. Removed dangerous .take() pattern.
C5. Trust Penalty Callback Race - FIXED
Location: icn-core/src/supervisor.rs:161-193
Issue: tokio::spawn() without await means trust updates race with gossip updates
Impact: Trust scores diverge across network
Fix: Changed from fire-and-forget tokio::spawn to synchronous tokio::task::block_in_place for trust penalty callback, ensuring updates complete before returning.
C6. Vote Tally Not Synchronized - VERIFIED IMPLEMENTED
Location: icn-core/src/governance/actor.rs:512-559
Issue: Tally computed on-demand, not persisted. Different nodes see different counts.
Impact: Governance proposals may pass on some nodes, fail on others
Status: Already implemented. Governance actor computes tally when closing proposals and broadcasts ProposalClosed message with canonical TallySnapshot via gossip.
C7. Proposal Outcome Not Gossiped - VERIFIED IMPLEMENTED
Location: icn-core/src/governance/actor.rs:553-559, icn-governance/src/message.rs
Issue: When proposal closes, outcome is local only
Impact: Nodes don't know final governance decisions
Status: Already implemented. GovernanceMessage::ProposalClosed variant includes outcome and tally snapshot. Receiving nodes handle and store the outcome (lines 776-791).
C8. RPC/Gateway Has No Trust-Based Rate Limiting - FIXED
Location: icn-rpc/src/server.rs, icn-rpc/src/auth.rs
Issue: All authenticated users get same rate limits regardless of trust
Impact: Low-trust peers can spam API
Fix: Added trust-gated rate limiter to RPC server using icn_net::RateLimiter. Different trust levels get different limits (Isolated: 10/sec, Known: 50/sec, Partner: 100/sec, Federated: 200/sec). Enabled automatically in supervisor.
Priority 2: HIGH (Correctness Issues)
These cause incorrect behavior but may not immediately crash the system.
H1. Configuration Changes Not Applied - FIXED
Location: icn-core/src/supervisor.rs:2083
Issue: ConfigChange proposals accepted but never take effect
Fix: Implemented config update execution. Supervisor now parses new_config JSON string into GovernanceConfig, calls GovernanceCommand::UpdateDomainConfig which updates the domain and broadcasts DomainUpdated message via gossip.
H2. Membership Updates Not Executed - FIXED
Location: icn-core/src/supervisor.rs:2089
Issue: Member add/remove proposals don't modify actual membership
Fix: Implemented membership update execution. Added GovernanceCommand::UpdateMembership which adds/removes members from MembershipSource::StaticList. Supervisor calls this when Membership proposals are accepted. Gossip broadcasts DomainUpdated.
H3. Replica Threshold Never Checked - FIXED
Location: icn-gossip/src/gossip.rs:1573
Issue: Phase 17 incomplete - replica count below threshold not detected
Fix: Added immediate replica threshold check during ReplicaStatus message handling. Now checks if healthy replica count < 3, emits content_under_replicated_detected_total metric, and logs warning. ReplicationManager handles remediation via its periodic health check.
H4. Partition Healing Incomplete - FIXED
Location: icn-gossip/src/gossip.rs:248
Issue: TODO for PartitionHealRequest/Response - uses empty VectorClock
Impact: Partitions detected but not actually healed
Fix: heal_partition_with_peer() now sends PartitionHealRequest with actual vector clock. Added mark_healing_started/complete to PartitionHealer for tracking. Response handler merges clocks and requests diverged entries.
H5. Ledger Entry Acceptance Has No Trust Check - FIXED
Location: icn-ledger/src/ledger.rs - append_entry()
Issue: Credit limits use trust, but entry acceptance doesn't validate trust
Impact: Malicious peers can spam ledger up to credit limit
Fix: Added trust_graph and min_trust_for_entry fields to Ledger. append_entry_internal() now validates author's trust score against threshold (default 0.1 = Known class). Rejects entries from low-trust authors with metric tracking.
H6. Default Trust Thresholds Too Permissive - FIXED
Locations: TLS (0.0), Compute (0.0)
Issue: Default accepts everyone with valid DID
Fix: Updated TrustGatedRateLimitConfig.min_trust_threshold default from 0.0 to 0.1 (Known trust class minimum). Added warning in TLS fallback path to indicate development-only mode. Compute already uses proper defaults (MIN_TRUST_SUBMIT=0.1, MIN_TRUST_EXECUTE=0.3).
H7. Gossip Messages Not Trust-Gated - FIXED
Location: icn-gossip/src/gossip.rs
Issue: Subscriptions check trust, but message flow doesn't
Fix: Added trust validation at start of handle_message(). Messages from senders with trust < 0.1 (Known class) are rejected. Unknown senders are also rejected. Metric messages_rejected_low_trust_total tracks rejections.
H8. Vector Clock Merge Missing Conflict Data - FIXED
Location: icn-gossip/src/partition.rs:145-189
Issue: Merge returns version numbers but no actual conflict entries
Fix: Created VersionGap struct with author_did, local_version, remote_version, detected_at timestamp, and GapDirection (RemoteAhead/LocalAhead/Diverged). Merge now returns Vec<VersionGap> with full context. Added merge_simple() for backward compatibility.
H9. Task Completion Not Published - VERIFIED IMPLEMENTED
Location: icn-compute/src/actor.rs:1424-1427
Issue: Status updated locally, never gossiped
Status: Already implemented. ComputeActor broadcasts ComputeMessage::TaskResult via send_callback after local execution (line 1425). It also broadcasts TaskCancelled for cancellations (line 1004) and TaskResult with timeout outcome for deadline failures (line 825). The task.rs file only manages local state; gossip publishing is correctly handled at the actor level.
Priority 3: MEDIUM (Quality/Reliability)
These affect robustness but system can function.
M1. NAT Traversal Relay Fallback Missing - FIXED
Location: icn-net/src/turn.rs, icn-net/src/session.rs
Issue: TURN relay not implemented (Phase 4 TODO)
Impact: Nodes behind symmetric NAT can't connect
Fix: Implemented TurnClient with RFC 5766 protocol (allocate, refresh, create_permission). Added TurnConfig with builder pattern to NetworkConfig. SessionManager creates allocation on startup if configured and includes relay address in connection candidates. Added TURN metrics.
M2. Profile Query Responses Not Implemented - FIXED
Location: icn-core/src/supervisor/mod.rs:1363
Issue: Profile queries received but not answered
Fix: Implemented profile query response handler. When a Query message is received, looks up the requested DID (own profile or cached peer profile) and publishes a Response message via gossip.
M3. Dead-Letter Queue Missing - FIXED
Location: icn-core/src/dead_letter.rs
Issue: Failed ledger entries logged but no recovery path
Fix: Implemented DeadLetterQueue with persistent storage, failure type tracking, retry support, and Prometheus metrics. Provides FailedOperation entries with context for manual review or automated retry.
M4. Executor Capacity Not Tracked - FIXED
Location: icn-compute/src/actor.rs:2206
Issue: Scheduler can't make informed placement decisions
Fix: Added capacity field to ExecutorInfo struct. on_capacity_announce() now stores capacity in the executor registry. Added get_executor_capacity() and get_all_executor_capacities() methods for scheduler placement decisions.
M5. Locality/Region Constraints Incomplete - FIXED
Location: icn-compute/src/actor.rs:1931-1937, icn-core/src/supervisor/mod.rs:2613-2643
Issue: Network RTT and blob registry integration missing
Fix: Added LocalityCallback type to ComputeActor that queries network topology for RTT data. Supervisor wires up callback to NeighborSets for live RTT lookup. Placement scoring now uses real network latency data when available.
M6. Fork Detection Index Not Atomic - VERIFIED NON-ISSUE
Location: icn-ledger/src/ledger.rs:119-176
Issue: Entry stored before fork index updated - crash window
Status: The ForkDetector is an in-memory structure that is rebuilt from persistent entries on startup via rebuild_fork_index(). Any crash window is recovered on restart. Not a data consistency issue.
M7. Balance Recomputation Race - FIXED
Location: icn-ledger/src/ledger.rs:531-578
Issue: Full recompute during quarantine can cause lost updates
Fix: Added journal_version tracking to Ledger. recompute_balances() validates snapshot isolation via version check before applying. Added recompute_balances_with_retry() convenience method that retries on version mismatch.
M8. Floating Point Offer Selection - FIXED
Location: icn-compute/src/actor.rs:2118-2126
Issue: f64 comparison non-deterministic across platforms
Fix: Implemented deterministic tie-breaking with epsilon-based float comparison (1e-9 threshold) and lexicographic DID comparison as tie-breaker for equal scores.
M9. Deliberation Period Clock Skew - FIXED
Location: icn-compute/src/actor.rs:2052-2063
Issue: 500ms wait uses local wall-clock, not synchronized
Fix: Implemented relative timing based on requested_at timestamp from PlacementRequest. Executors calculate deadline = requested_at + DELIBERATION_PERIOD_MS and wait only the remaining time, ensuring all executors broadcast at approximately the same wall-clock time regardless of network latency.
Priority 4: ARCHITECTURAL (Technical Debt)
These don't cause immediate bugs but make the system harder to maintain.
A1. Supervisor God Object - IN PROGRESS
Location: icn-core/src/supervisor/ (now modular)
Issue: Creates, wires, and manages 12+ subsystems with 38+ lock acquisitions
Impact: Can't test components in isolation, high-risk changes
Status: Phase 1 complete (2025-12-13). Extracted to supervisor/ directory with modules:
init_trust.rs- Trust graph and misbehavior detectorinit_gossip.rs- Gossip actor, partitions, replicationinit_ledger.rs- Ledger, disputes, contractsregistry.rs- Service container typesshutdown.rs- Graceful shutdown and snapshot managementmod.rs- Main supervisor (reduced from 3571 to 3256 lines, -315 lines) Remaining: Network/message handlers, governance subscriptions, compute callbacks have deeply embedded closures requiring fuller ServiceRegistry integration
A2. Circular Crate Dependencies - VERIFIED NON-ISSUE
Locations: icn-net ↔ icn-gossip ↔ icn-ledger
Claimed Issue: Can't version or update crates independently
Verification: Analyzed with cargo tree. Dependencies form a DAG, not a cycle:
- icn-net → icn-ledger, icn-gossip
- icn-ledger → icn-gossip (no reverse dependency)
- icn-gossip → icn-identity, icn-trust, etc. (no dependency on icn-net or icn-ledger) No circular dependencies exist. Crates can be versioned independently.
A3. Multiple Sources of Truth (Trust Graph) - VERIFIED APPROPRIATE PATTERN
Issue: Trust graph shared via Arc<RwLock<>> to 6+ actors without coordination Analysis: Arc<RwLock<>> IS the coordination mechanism. This is the standard Rust pattern for shared mutable state across actors. The RwLock provides:
- Multiple concurrent readers
- Exclusive writer access
- Automatic coordination via lock acquisition Alternative patterns (message-passing, CRDT) would add complexity without benefit for in-process actors.
A4. Inconsistent Callback Patterns - VERIFIED APPROPRIATE PATTERN
Issue: Each actor defines own callback types, no common abstraction Analysis: Actors have different callback needs (different input/output types, sync vs async). A common ActorCallback trait would require either:
- Excessive generics that hurt ergonomics
- Runtime type erasure that loses type safety The current approach provides type-safe, actor-specific callbacks. This is idiomatic Rust.
A5. Configuration Sprawl - FIXED
Issue: Hardcoded values scattered across supervisor.rs
Fix: Created SupervisorConfig struct in config.rs centralizing:
candidate_cleanup_interval_secs(default: 300)peer_exchange_delay_ms(default: 500)peer_exchange_max_peers(default: 50)metrics_update_interval_secs(default: 10)shutdown_timeout_secs(default: 5)clock_sync_interval_secs(default: 600) Updated supervisor.rs to use config values instead of hardcoded literals.
A6. Error Swallowing - FIXED (OBSERVABILITY APPROACH)
Locations: 8+ places in supervisor.rs
Issue: Errors logged but not propagated
Analysis: Most errors occur in async contexts (background tasks, notification handlers) where there's no caller to propagate to. Logging is the appropriate pattern for these cases.
Fix: Added icn_obs::metrics::supervisor module with observability metrics:
icn_supervisor_errors_total{operation}- Error counter by operation typeicn_supervisor_state- State gauge (0=stopped, 1=starting, 2=running, 3=stopping)icn_supervisor_actors_spawned_total{actor}- Actor spawn countericn_supervisor_actor_spawn_failures_total{actor}- Spawn failure counter Key error locations instrumented: metrics_server_start, rpc_server, gateway_server, identity_bundle_missing, gateway_jwt_secret_missing, shutdown_timeout. Errors are now alertable via Prometheus.
A7. Panic! in Production Code - VERIFIED NON-ISSUE
Locations: icn-ledger/sync.rs:86, icn-ledger/dispute.rs:553,625, icn-net/protocol.rs (6 places)
Issue: Panics instead of error returns
Status: All reported panics are inside #[cfg(test)] modules (test code only). No panics exist in production code paths. Verified 2025-12-13.
A8. Byzantine Detector Ownership Unclear - VERIFIED APPROPRIATE PATTERN
Issue: Created in supervisor, shared to Network, Gossip, Ledger Analysis: Ownership IS explicit:
- Created:
init_trust.rsline 56-64, part of TrustServices - Owned: Supervisor holds the Arc<RwLock
> - Shared: Passed to NetworkActor, GossipActor, LedgerDeps, ComputeActor via clone() This is the correct pattern for a component that:
- Aggregates misbehavior reports from multiple sources
- Provides unified tracking across actors
- Applies trust penalties when thresholds are exceeded
Well-documented in
init_trust.rswith comments explaining the shared ownership design.
Gap Summary by System
| System | Critical | High | Medium | Arch | Total |
|---|---|---|---|---|---|
| Ledger | 1 | 2 | 2 | 0 | 5 |
| Governance | 2 | 2 | 0 | 0 | 4 |
| Trust | 1 | 3 | 0 | 1 | 5 |
| Gossip | 0 | 2 | 0 | 0 | 2 |
| Compute | 1 | 1 | 4 | 0 | 6 |
| Network | 0 | 0 | 1 | 0 | 1 |
| RPC/Gateway | 1 | 0 | 0 | 0 | 1 |
| Core/Supervisor | 2 | 2 | 2 | 7 | 13 |
| Total | 8 | 12 | 9 | 8 | 47 |
Recommended Fix Order
Week 1: Critical Consistency Fixes
- C4 - Ledger gossip handle race (prevents split-brain)
- C5 - Trust penalty callback race (prevents trust divergence)
- C6 - Vote tally synchronization (governance correctness)
- C7 - Proposal outcome gossip (governance visibility)
Week 2: Critical Feature Completion
- C1 - Ledger rollback implementation
- C2 - Dispute resolution execution
- C8 - Trust-based API rate limiting
Week 3: High Priority Correctness
- H4 - Partition healing protocol
- H5 - Ledger entry trust validation
- H7 - Gossip message trust gating
- H1/H2 - Config and membership updates
Week 4: Compute Layer Completion
- C3 - Actor pause/resume
- H9 - Task completion gossip
- M4/M5 - Executor capacity and locality
Week 5+: Architectural Cleanup
- A1 - Supervisor refactoring (incremental)
- A6/A7 - Error handling cleanup
- A2 - Crate dependency cleanup
Test Coverage Needed
// Critical consistency tests
test_ledger_concurrent_append_with_gossip()
test_trust_penalty_vs_gossip_race()
test_proposal_tally_consistency_across_nodes()
test_partition_heal_with_conflicting_entries()
test_task_completion_both_nodes_agree()
// Trust enforcement tests
test_ledger_entry_rejected_low_trust()
test_api_rate_limited_by_trust_class()
test_gossip_message_rejected_low_trust()
// Governance tests
test_proposal_outcome_gossip_propagation()
test_vote_ordering_deterministic()
// Compute tests
test_actor_migration_pause_resume()
test_task_status_gossip_sync()
What's Actually Working Well
Despite the gaps, these systems are solid:
- Identity & Keystore: Multi-device, age-encrypted, migrations work
- Network Layer: QUIC/TLS, rate limiting, signed envelopes all good
- Gossip Core: Vector clocks, subscriptions, anti-entropy work
- Ledger Core: Double-entry, Merkle-DAG, credit limits work
- Contract Execution: CCL interpreter, fuel metering work
- Security Detection: Byzantine detection, reputation, quarantine work
- Gateway API: REST/WebSocket endpoints, JWT auth work
The gaps are in integration (systems don't talk to each other correctly) and edge cases (concurrent operations, failure recovery).
Conclusion
The ICN codebase is architecturally sound but has critical integration gaps. The most dangerous issues are:
- Consistency bugs that cause split-brain (ledger, trust, governance)
- Trust enforcement gaps that undermine the security model
- Incomplete features marked TODO that are assumed working
Fixing the 8 Critical items is essential before any production use. The 12 High items should follow. The Medium and Architectural items can be addressed incrementally.
Estimated effort: 4-5 weeks for Critical + High priority fixes.