๐ŸŽ‰ ICN Demo Wiring - ROOT CAUSE FOUND!

Archived Document Notice (2026-02-12): This file is retained for historical context and may not reflect current code, APIs, runtime defaults, CI status, or deployment posture. Use active documentation under docs/ as authoritative.

Date: 2025-12-18 17:21
Time to Discovery: 2.5 hours
Status: โœ… BUG IDENTIFIED - FIX AVAILABLE


๐Ÿ” Root Cause

File: icn/crates/icn-net/src/session.rs
Line: 178
Bug: Double-binding to same UDP port

The Issue

// Line 165: QUIC endpoint binds to listen_addr
let mut endpoint = Endpoint::server(server_config, listen_addr)?;

// Line 168: Successfully listening
info!("QUIC endpoint listening on {}", endpoint.local_addr()?);

// Line 178: BUG - Tries to bind ANOTHER socket to same address!
let socket = tokio::net::UdpSocket::bind(local_addr).await?;  // โŒ FAILS HERE

What Happens:

  1. QUIC endpoint successfully binds to port 19777
  2. Code then tries to create a second UDP socket on port 19777 for STUN queries
  3. Second bind fails with "Address already in use"
  4. Error propag ates up, triggers shutdown
  5. All actors stop
  6. Runtime exits

The Irony: The comment on line 176 says "We use the same socket" but the code creates a NEW socket instead!


โœ… The Fix

Option 1: Reuse QUIC Endpoint's Socket (Proper Fix)

Quinn's Endpoint exposes the underlying UDP socket. Use that instead of binding a new one:

// Instead of:
let socket = tokio::net::UdpSocket::bind(local_addr).await?;

// Do:
// Quinn should provide access to the underlying socket
// Need to check Quinn API for how to get it

Option 2: Disable STUN Discovery (Quick Workaround)

For demo purposes, we can just remove STUN discovery. The node will work on local network without it.

In demo.toml:

[network]
# Remove or set to empty
stun_servers = []

Or in code: Comment out lines 178-192 in session.rs

Option 3: Bind Before Creating Endpoint

// Bind socket first
let socket = tokio::net::UdpSocket::bind(listen_addr).await?;

// Get the actual bound address
let local_addr = socket.local_addr()?;

// Create endpoint using the socket (if Quinn supports this)
// Or: Do STUN discovery before creating endpoint

๐Ÿš€ Immediate Action Plan

For Demo (5 minutes)

Quickest path: Disable STUN in config

# Edit demo.toml - remove or comment out stun_servers
nano <demo-data-dir>/demo.toml

# Or: Don't pass stun_servers in config at all

Then daemon should start successfully!

For Proper Fix (30 minutes)

  1. Check Quinn documentation for how to access underlying socket

  2. Either:

    • A) Reuse endpoint's socket for STUN queries
    • B) Do STUN discovery before creating endpoint
    • C) Use a different approach (separate STUN socket on different port)
  3. Test fix

  4. Submit PR with fix


๐Ÿ“Š Impact Assessment

Why This Wasn't Caught in Tests

Tests probably:

  1. Don't enable STUN discovery, OR
  2. Use mock STUN servers, OR
  3. Test network actor in isolation without full initialization

Why This Affects Us Now

Our config has:

# From supervisor code - STUN servers are hard-coded
stun_servers = ["stun.l.google.com:19302", "stun1.l.google.com:19302"]

These get resolved and passed to session_manager.start(), triggering the buggy code path.


โœจ Next Steps

Immediate (now):

cd <repo-root>/icn/crates/icn-net/src

# Quick fix: Comment out the problematic socket bind
# Edit session.rs line 178-179

Test (5 minutes):

cd <repo-root>/icn
cargo build --release

# Start daemon
./target/release/icnd --config <demo-data-dir>/demo.toml \
    --gateway-enable \
    --gateway-bind "127.0.0.1:8080" \
    --gateway-jwt-secret "demo-secret-key-change-in-production"

Expected Result:

โœ… QUIC endpoint listening
โœ… Gateway API spawned
โœ… Supervisor waiting for shutdown
โœ… DAEMON RUNNING!

๐ŸŽฏ Confidence Level

Before: 45% full stack demo, 85% CLI demo
After fix: 90% full stack demo, 95% CLI demo

Time to working demo: 30-60 minutes (apply fix + test)


๐Ÿ† Lessons Learned

  1. Time boxing worked - Would have found this eventually, but staying focused helped
  2. Following the error messages - "Address already in use" was real, just not what we thought
  3. Reading logs carefully - The timestamp sequence revealed the issue
  4. Grep is your friend - Finding the exact error context was key
  5. Sometimes bugs are obvious - Double-bind to same port is a classic mistake

๐Ÿ“ For Future Reference

When you see "Address already in use":

  1. First check: Is something else using the port? (we did this)
  2. Second check: Is the SAME process trying to bind twice? (should have checked this earlier!)

The smoking gun was: Actors stopping immediately after "QUIC endpoint listening" - that timing meant the error was in the same code path, not external.


Status: READY TO FIX AND TEST! ๐Ÿš€

Let's apply the fix and get this daemon running!