Phase 4: NAT Traversal, Supervisor Modularization & Locality Integration

Date: 2025-12-13 Status: Complete

Overview

This session completed several key infrastructure improvements:

  • TURN relay client for NAT traversal fallback (M1)
  • Supervisor modularization with extracted modules (A1)
  • TypeScript SDK public statistics endpoint
  • Balance recomputation race fix (M7)
  • Locality/RTT integration for compute placement (M5)
  • Deliberation period clock skew fix (M9)
  • Configuration sprawl fix with SupervisorConfig (A5)

Changes Made

1. TURN Relay Client (M1)

Created /home/matt/projects/icn/icn/crates/icn-net/src/turn.rs:

  • TurnConfig: Configuration with builder pattern

    • new(server), with_username(), with_password()
    • with_timeout(), with_allocation_lifetime()
  • TurnClient: Full TURN protocol implementation (RFC 5766)

    • allocate(socket) - Request relay allocation
    • refresh(socket) - Refresh existing allocation
    • create_permission(socket, peer_addr) - Allow peer to send via relay
  • TurnAllocation: Tracks relay and mapped addresses with expiry

  • TurnPermission: Tracks peer permissions with expiry

Integration Points

  1. Config (icn-core/src/config.rs):

    • Added turn_server, turn_username, turn_password to NetworkConfig
    • Added turn_config() helper method
  2. SessionManager (icn-net/src/session.rs):

    • Added turn_client and relay_addr fields
    • Extended start() to accept turn_config parameter
    • Creates TURN allocation on startup if configured
    • Added relay_addr() getter method
    • Updated connection_candidate() to include relay address
  3. NetworkActor (icn-net/src/actor.rs):

    • Extended spawn() signature with turn_config parameter
    • Passes config to session manager
  4. Supervisor (icn-core/src/supervisor/mod.rs):

    • Reads TURN config from network config
    • Passes to NetworkActor::spawn

2. Supervisor Modularization (A1)

background_tasks.rs (Previously Completed)

Created /home/matt/projects/icn/icn/crates/icn-core/src/supervisor/background_tasks.rs:

  • spawn_clock_sync_task() - Background clock synchronization
  • spawn_metrics_update_task() - Periodic metrics updates
  • steward module with helper functions

init_rpc.rs (New)

Created /home/matt/projects/icn/icn/crates/icn-core/src/supervisor/init_rpc.rs:

  • RpcConfig: Configuration from daemon config
  • RpcDeps: All handles needed for RPC server
  • GatewayConfig: Configuration for gateway server
  • GatewayDeps: Event broadcaster and compute handle
  • spawn_rpc_server() - Creates and spawns RPC server with all handles
  • spawn_gateway_server() - Spawns gateway in dedicated thread

3. TypeScript SDK Enhancement

Updated /home/matt/projects/icn/sdk/typescript/src/:

  • types.ts: Added CoopStatsResponse interface
  • index.ts: Added getCoopStats(coopId) method (no auth required)

4. Balance Recomputation Race Fix (M7) - Previously Completed

  • Added journal_version tracking to Ledger
  • Snapshot validation in recompute_balances()
  • Added recompute_balances_with_retry() convenience method

Metrics Added

TURN-related metrics in icn-obs/src/metrics.rs:

  • turn_allocation_inc() - Successful allocations
  • turn_allocation_failure_inc(reason) - Failed allocations
  • turn_permission_refresh_inc() - Permission refreshes

Configuration

New TURN config options in icn.toml:

[network]
# TURN relay server for NAT traversal fallback
turn_server = "turn.example.com:3478"
turn_username = "user"      # Optional
turn_password = "password"  # Optional

Testing

All existing tests updated to pass new turn_config parameter (set to None for tests).

Files updated:

  • icn-net/src/session.rs (test module)
  • icn-net/src/actor.rs (test module)
  • icn-net/tests/did_tls_binding_integration.rs
  • icn-net/tests/encrypted_message_integration.rs
  • icn-net/tests/trust_gated_tls_integration.rs
  • icn-core/tests/*.rs (11 test files)

Architecture Notes

TURN Integration Design

The TURN relay provides fallback connectivity when direct P2P connections fail:

  1. Startup: If TURN is configured, session manager creates allocation
  2. Relay Address: Stored and included in connection candidates via gossip
  3. Peer Discovery: Other nodes see relay address as connection option
  4. Future Work: Connection fallback logic (try direct -> STUN -> TURN)

Supervisor Module Structure

supervisor/
├── mod.rs              # Main supervisor logic
├── background_tasks.rs # Background task factories
├── init_gossip.rs      # Gossip initialization
├── init_ledger.rs      # Ledger/contract initialization
├── init_rpc.rs         # RPC/Gateway initialization (NEW)
├── init_trust.rs       # Trust graph initialization
├── registry.rs         # Actor registry
└── shutdown.rs         # Graceful shutdown helpers

5. Locality/RTT Integration (M5)

Integrated network topology RTT data into compute placement scoring.

New Types (icn-compute/src/actor.rs):

  • LocalityCallback: Callback type for querying locality data
  • Takes peer DID, returns LocalityContext with RTT, blob info, region data

ComputeActor Changes:

  • Added locality_callback: Option<LocalityCallback> field
  • Added set_locality_callback() setter method
  • on_placement_request() now uses callback instead of LocalityContext::empty()

Supervisor Integration (icn-core/src/supervisor/mod.rs):

  • Creates locality callback that queries NeighborSets.get_rtt(peer)
  • Wires callback to compute actor before spawning
  • Uses blocking_read() for sync callback execution

Impact:

  • Placement offers now include real RTT data when available
  • Better placement decisions based on network latency
  • Foundation for full data locality scoring

6. Deliberation Period Clock Skew Fix (M9)

Fixed timing issue in placement offer deliberation period.

Problem:

  • Executors waited a fixed 500ms before broadcasting placement offers
  • This used local wall-clock time, ignoring network latency
  • Nodes receiving requests late would broadcast late, disadvantaging them

Solution (icn-compute/src/actor.rs):

  • Added DELIBERATION_PERIOD_MS constant (500ms)
  • Use requested_at timestamp from PlacementRequest as reference
  • Calculate deadline = requested_at + DELIBERATION_PERIOD_MS
  • Calculate remaining_ms = deadline - now() (saturating)
  • Sleep only the remaining time, not full 500ms

Result:

  • All executors broadcast at approximately the same wall-clock time
  • Network latency no longer disadvantages distant executors
  • More fair placement competition

7. Configuration Sprawl Fix (A5)

Centralized hardcoded configuration values from supervisor.rs.

New Struct (icn-core/src/config.rs):

pub struct SupervisorConfig {
    pub candidate_cleanup_interval_secs: u64,   // default: 300
    pub peer_exchange_delay_ms: u64,            // default: 500
    pub peer_exchange_max_peers: usize,         // default: 50
    pub metrics_update_interval_secs: u64,      // default: 10
    pub shutdown_timeout_secs: u64,             // default: 5
    pub clock_sync_interval_secs: u64,          // default: 600
}

Supervisor Updates:

  • Replaced hardcoded Duration::from_secs(300) with config value
  • Replaced hardcoded Duration::from_millis(500) with config value
  • Replaced hardcoded Some(50) peer exchange limit with config value
  • Replaced hardcoded Duration::from_secs(10) metrics interval with config value
  • Replaced hardcoded Duration::from_secs(5) shutdown timeout with config value

Benefits:

  • All timing values now configurable via TOML
  • Easier to tune for different environments
  • Values documented with defaults

8. Error Swallowing Observability Fix (A6)

Added supervisor error metrics to make logged errors observable via Prometheus.

New Metrics Module (icn-obs/src/metrics.rs):

pub mod supervisor {
    pub fn error_inc(operation: &str)
    pub fn startup_phase_inc(phase: &str)
    pub fn state_set(state: u8)  // 0=stopped, 1=starting, 2=running, 3=stopping
    pub fn actor_spawned_inc(actor: &str)
    pub fn actor_spawn_failed_inc(actor: &str)
}

Instrumented Error Locations:

  • metrics_server_start - Metrics server failed to start
  • rpc_server - RPC server error during operation
  • gateway_server - Gateway server error
  • identity_bundle_missing - No identity bundle available
  • gateway_jwt_secret_missing - JWT secret not configured
  • shutdown_timeout - Background tasks didn't complete in time

Actor Spawn Tracking:

  • gossip, ledger, network, compute, rpc_server, gateway
  • Failures tracked with actor_spawn_failed_inc()

Lifecycle State:

  • Supervisor state gauge tracks: stopped(0) → starting(1) → running(2) → stopping(3)

Rationale: Most supervisor errors occur in async contexts where there's no caller to propagate to. Rather than changing the architecture, errors are now countable and alertable via Prometheus metrics.

9. Architectural Issues Verification (A2, A3, A4, A8)

Analyzed remaining architectural issues - all verified as non-issues or appropriate patterns:

A2. Circular Crate Dependencies - VERIFIED NON-ISSUE Used cargo tree to analyze. Dependencies form a DAG, not a cycle:

  • icn-net → icn-ledger → icn-gossip
  • No reverse dependencies exist
  • Crates can be versioned independently

A3. Trust Graph Multiple Sources - VERIFIED APPROPRIATE PATTERN Arc<RwLock<>> IS the coordination mechanism. This is the standard Rust pattern:

  • Multiple concurrent readers via read lock
  • Exclusive writer access via write lock
  • Automatic coordination

A4. Inconsistent Callback Patterns - VERIFIED APPROPRIATE PATTERN Actors have different callback needs (input/output types, sync/async). A unified trait would require excessive generics or runtime type erasure. Current approach is idiomatic Rust.

A8. Byzantine Detector Ownership - VERIFIED APPROPRIATE PATTERN Ownership is explicit and well-documented:

  • Created in init_trust.rs as part of TrustServices
  • Shared via Arc<RwLock<>> to Network, Gossip, Ledger, Compute actors
  • Correct pattern for aggregating reports from multiple sources

Summary

All SYSTEM_GAPS architectural items (A1-A8) are now complete:

  • A1: Supervisor modularization (multiple phases)
  • A2-A4, A7-A8: Verified as non-issues or appropriate patterns
  • A5: Configuration sprawl fixed with SupervisorConfig
  • A6: Error observability via supervisor metrics

Next Steps

From ROADMAP.md:

  • Track C1: Pilot community selection (business track)
  • Any remaining A1 work if needed