Phase 4: NAT Traversal, Supervisor Modularization & Locality Integration
Date: 2025-12-13 Status: Complete
Overview
This session completed several key infrastructure improvements:
- TURN relay client for NAT traversal fallback (M1)
- Supervisor modularization with extracted modules (A1)
- TypeScript SDK public statistics endpoint
- Balance recomputation race fix (M7)
- Locality/RTT integration for compute placement (M5)
- Deliberation period clock skew fix (M9)
- Configuration sprawl fix with SupervisorConfig (A5)
Changes Made
1. TURN Relay Client (M1)
Created /home/matt/projects/icn/icn/crates/icn-net/src/turn.rs:
TurnConfig: Configuration with builder pattern
new(server),with_username(),with_password()with_timeout(),with_allocation_lifetime()
TurnClient: Full TURN protocol implementation (RFC 5766)
allocate(socket)- Request relay allocationrefresh(socket)- Refresh existing allocationcreate_permission(socket, peer_addr)- Allow peer to send via relay
TurnAllocation: Tracks relay and mapped addresses with expiry
TurnPermission: Tracks peer permissions with expiry
Integration Points
Config (
icn-core/src/config.rs):- Added
turn_server,turn_username,turn_passwordto NetworkConfig - Added
turn_config()helper method
- Added
SessionManager (
icn-net/src/session.rs):- Added
turn_clientandrelay_addrfields - Extended
start()to acceptturn_configparameter - Creates TURN allocation on startup if configured
- Added
relay_addr()getter method - Updated
connection_candidate()to include relay address
- Added
NetworkActor (
icn-net/src/actor.rs):- Extended
spawn()signature withturn_configparameter - Passes config to session manager
- Extended
Supervisor (
icn-core/src/supervisor/mod.rs):- Reads TURN config from network config
- Passes to NetworkActor::spawn
2. Supervisor Modularization (A1)
background_tasks.rs (Previously Completed)
Created /home/matt/projects/icn/icn/crates/icn-core/src/supervisor/background_tasks.rs:
spawn_clock_sync_task()- Background clock synchronizationspawn_metrics_update_task()- Periodic metrics updatesstewardmodule with helper functions
init_rpc.rs (New)
Created /home/matt/projects/icn/icn/crates/icn-core/src/supervisor/init_rpc.rs:
- RpcConfig: Configuration from daemon config
- RpcDeps: All handles needed for RPC server
- GatewayConfig: Configuration for gateway server
- GatewayDeps: Event broadcaster and compute handle
spawn_rpc_server()- Creates and spawns RPC server with all handlesspawn_gateway_server()- Spawns gateway in dedicated thread
3. TypeScript SDK Enhancement
Updated /home/matt/projects/icn/sdk/typescript/src/:
- types.ts: Added
CoopStatsResponseinterface - index.ts: Added
getCoopStats(coopId)method (no auth required)
4. Balance Recomputation Race Fix (M7) - Previously Completed
- Added
journal_versiontracking to Ledger - Snapshot validation in
recompute_balances() - Added
recompute_balances_with_retry()convenience method
Metrics Added
TURN-related metrics in icn-obs/src/metrics.rs:
turn_allocation_inc()- Successful allocationsturn_allocation_failure_inc(reason)- Failed allocationsturn_permission_refresh_inc()- Permission refreshes
Configuration
New TURN config options in icn.toml:
[network]
# TURN relay server for NAT traversal fallback
turn_server = "turn.example.com:3478"
turn_username = "user" # Optional
turn_password = "password" # Optional
Testing
All existing tests updated to pass new turn_config parameter (set to None for tests).
Files updated:
icn-net/src/session.rs(test module)icn-net/src/actor.rs(test module)icn-net/tests/did_tls_binding_integration.rsicn-net/tests/encrypted_message_integration.rsicn-net/tests/trust_gated_tls_integration.rsicn-core/tests/*.rs(11 test files)
Architecture Notes
TURN Integration Design
The TURN relay provides fallback connectivity when direct P2P connections fail:
- Startup: If TURN is configured, session manager creates allocation
- Relay Address: Stored and included in connection candidates via gossip
- Peer Discovery: Other nodes see relay address as connection option
- Future Work: Connection fallback logic (try direct -> STUN -> TURN)
Supervisor Module Structure
supervisor/
├── mod.rs # Main supervisor logic
├── background_tasks.rs # Background task factories
├── init_gossip.rs # Gossip initialization
├── init_ledger.rs # Ledger/contract initialization
├── init_rpc.rs # RPC/Gateway initialization (NEW)
├── init_trust.rs # Trust graph initialization
├── registry.rs # Actor registry
└── shutdown.rs # Graceful shutdown helpers
5. Locality/RTT Integration (M5)
Integrated network topology RTT data into compute placement scoring.
New Types (icn-compute/src/actor.rs):
LocalityCallback: Callback type for querying locality data- Takes peer DID, returns
LocalityContextwith RTT, blob info, region data
ComputeActor Changes:
- Added
locality_callback: Option<LocalityCallback>field - Added
set_locality_callback()setter method on_placement_request()now uses callback instead ofLocalityContext::empty()
Supervisor Integration (icn-core/src/supervisor/mod.rs):
- Creates locality callback that queries
NeighborSets.get_rtt(peer) - Wires callback to compute actor before spawning
- Uses
blocking_read()for sync callback execution
Impact:
- Placement offers now include real RTT data when available
- Better placement decisions based on network latency
- Foundation for full data locality scoring
6. Deliberation Period Clock Skew Fix (M9)
Fixed timing issue in placement offer deliberation period.
Problem:
- Executors waited a fixed 500ms before broadcasting placement offers
- This used local wall-clock time, ignoring network latency
- Nodes receiving requests late would broadcast late, disadvantaging them
Solution (icn-compute/src/actor.rs):
- Added
DELIBERATION_PERIOD_MSconstant (500ms) - Use
requested_attimestamp fromPlacementRequestas reference - Calculate
deadline = requested_at + DELIBERATION_PERIOD_MS - Calculate
remaining_ms = deadline - now()(saturating) - Sleep only the remaining time, not full 500ms
Result:
- All executors broadcast at approximately the same wall-clock time
- Network latency no longer disadvantages distant executors
- More fair placement competition
7. Configuration Sprawl Fix (A5)
Centralized hardcoded configuration values from supervisor.rs.
New Struct (icn-core/src/config.rs):
pub struct SupervisorConfig {
pub candidate_cleanup_interval_secs: u64, // default: 300
pub peer_exchange_delay_ms: u64, // default: 500
pub peer_exchange_max_peers: usize, // default: 50
pub metrics_update_interval_secs: u64, // default: 10
pub shutdown_timeout_secs: u64, // default: 5
pub clock_sync_interval_secs: u64, // default: 600
}
Supervisor Updates:
- Replaced hardcoded
Duration::from_secs(300)with config value - Replaced hardcoded
Duration::from_millis(500)with config value - Replaced hardcoded
Some(50)peer exchange limit with config value - Replaced hardcoded
Duration::from_secs(10)metrics interval with config value - Replaced hardcoded
Duration::from_secs(5)shutdown timeout with config value
Benefits:
- All timing values now configurable via TOML
- Easier to tune for different environments
- Values documented with defaults
8. Error Swallowing Observability Fix (A6)
Added supervisor error metrics to make logged errors observable via Prometheus.
New Metrics Module (icn-obs/src/metrics.rs):
pub mod supervisor {
pub fn error_inc(operation: &str)
pub fn startup_phase_inc(phase: &str)
pub fn state_set(state: u8) // 0=stopped, 1=starting, 2=running, 3=stopping
pub fn actor_spawned_inc(actor: &str)
pub fn actor_spawn_failed_inc(actor: &str)
}
Instrumented Error Locations:
metrics_server_start- Metrics server failed to startrpc_server- RPC server error during operationgateway_server- Gateway server erroridentity_bundle_missing- No identity bundle availablegateway_jwt_secret_missing- JWT secret not configuredshutdown_timeout- Background tasks didn't complete in time
Actor Spawn Tracking:
- gossip, ledger, network, compute, rpc_server, gateway
- Failures tracked with
actor_spawn_failed_inc()
Lifecycle State:
- Supervisor state gauge tracks: stopped(0) → starting(1) → running(2) → stopping(3)
Rationale: Most supervisor errors occur in async contexts where there's no caller to propagate to. Rather than changing the architecture, errors are now countable and alertable via Prometheus metrics.
9. Architectural Issues Verification (A2, A3, A4, A8)
Analyzed remaining architectural issues - all verified as non-issues or appropriate patterns:
A2. Circular Crate Dependencies - VERIFIED NON-ISSUE
Used cargo tree to analyze. Dependencies form a DAG, not a cycle:
- icn-net → icn-ledger → icn-gossip
- No reverse dependencies exist
- Crates can be versioned independently
A3. Trust Graph Multiple Sources - VERIFIED APPROPRIATE PATTERN Arc<RwLock<>> IS the coordination mechanism. This is the standard Rust pattern:
- Multiple concurrent readers via read lock
- Exclusive writer access via write lock
- Automatic coordination
A4. Inconsistent Callback Patterns - VERIFIED APPROPRIATE PATTERN Actors have different callback needs (input/output types, sync/async). A unified trait would require excessive generics or runtime type erasure. Current approach is idiomatic Rust.
A8. Byzantine Detector Ownership - VERIFIED APPROPRIATE PATTERN Ownership is explicit and well-documented:
- Created in
init_trust.rsas part of TrustServices - Shared via Arc<RwLock<>> to Network, Gossip, Ledger, Compute actors
- Correct pattern for aggregating reports from multiple sources
Summary
All SYSTEM_GAPS architectural items (A1-A8) are now complete:
- A1: Supervisor modularization (multiple phases)
- A2-A4, A7-A8: Verified as non-issues or appropriate patterns
- A5: Configuration sprawl fixed with SupervisorConfig
- A6: Error observability via supervisor metrics
Next Steps
From ROADMAP.md:
- Track C1: Pilot community selection (business track)
- Any remaining A1 work if needed