03: Errors and Tracing — Distributed Visibility

Phase: 1 | Tier: Reader
Patterns introduced: Error Handling, Receipt Pattern, Metrics Integration
Prerequisite: 02-rust-through-icn.md

Why This Matters

"You can't debug what you can't see." Distributed systems multiply this problem: failures happen across nodes, causality is non-obvious, and post-mortems rely on structured logs and metrics.

ICN makes visibility a first-class concern — errors carry structured context, tracing spans show request flows, and metrics expose system health. This layer teaches you to instrument code before building features, not as an afterthought.

→ See manual.md § "Observability First" for the design rationale.

What You'll Read

1. Foundation: `icn/crates/icn-core/src/error.rs`

This is the simplest error pattern — thiserror for domain errors:

#[derive(Debug, thiserror::Error)]
pub enum CoreError {
    #[error("Actor not found: {0}")]
    ActorNotFound(String),
    
    #[error("Shutdown in progress")]
    ShuttingDown,
    
    #[error(transparent)]
    Io(#[from] std::io::Error),
}

Key observations:

#[error("...")] provides human-readable messages
#[from] enables ? operator for automatic error conversion
#[error(transparent)] delegates Display to the inner error
No unwrap() or expect() — always Result<T, E>

2. Structured Errors: `icn/crates/icn-ledger/src/error.rs`

Domain errors add structured fields for better diagnostics:

#[derive(Debug, thiserror::Error)]
pub enum LedgerError {
    #[error("Credit limit exceeded: account={account}, attempted={attempted}, limit={limit}")]
    CreditLimitExceeded {
        account: Did,
        attempted: i64,
        limit: i64,
    },
    
    #[error("Unbalanced entry: sum={sum}")]
    UnbalancedEntry { sum: i64 },
    
    #[error("Entry quarantined: {reason}")]
    Quarantined { reason: String },
}

Why structured fields matter:

Errors become queryable (e.g., "Show all credit limit violations for account X")
Metrics can extract values (e.g., histogram of attempted amounts)
Logs preserve context without string parsing

3. Source Location Tracking: `icn/crates/icn-ccl/src/error.rs`

Contract execution errors include source spans (line/column) for debugging:

#[derive(Debug, thiserror::Error)]
pub enum CclError {
    #[error("Type mismatch at {span:?}: expected {expected}, got {actual}")]
    TypeMismatch {
        span: Span,
        expected: String,
        actual: String,
    },
    
    #[error("Out of fuel at {span:?}")]
    OutOfFuel { span: Span },
}

pub struct Span {
    pub start: usize,
    pub end: usize,
}

Why spans matter: When a contract fails, developers need to know where in the source code the failure occurred. Spans enable IDE integration, syntax highlighting of errors, and precise debugging.

4. End-to-End Error Flow

Trace this sequence:

Validation: Ledger::validate_entry() in icn/crates/icn-ledger/src/ledger.rs:487-520

if account_balance + amount < credit_limit {
    return Err(LedgerError::CreditLimitExceeded {
        account: posting.account.clone(),
        attempted: amount,
        limit: credit_limit,
    });
}

Quarantine: Failed entry is moved to quarantine (not dropped)
```
self.quarantine.insert(entry.id.clone(), entry);
```

Metrics: Failure increments counter

metrics::counter!("ledger_validation_failures_total", 
    "reason" => "credit_limit_exceeded").increment(1);

Gateway: Error propagates to REST API in icn-gateway/src/api/ledger.rs

Err(e) => {
    tracing::error!(error = %e, "Entry validation failed");
    ApiError::ValidationFailed(e.to_string())
}

Client: Receives structured JSON error with code + details

{
    "error": {
        "code": "VALIDATION_FAILED",
        "message": "Credit limit exceeded: ...",
        "details": { "account": "...", "attempted": 1000, "limit": 500 }
    }
}

5. The Lint Rule: Never Panic in Production

File: icn/crates/icn-core/Cargo.toml and all crate manifests

[lints.clippy]
unwrap_used = "deny"
expect_used = "deny"

Escape hatch for tests:

#![deny(clippy::unwrap_used, clippy::expect_used)]

#[cfg(test)]
mod tests {
    // Tests can use unwrap/expect via this attribute
    #![allow(clippy::unwrap_used, clippy::expect_used)]
    
    #[test]
    fn test_happy_path() {
        let result = some_operation().unwrap();  // OK in tests
        assert_eq!(result, expected);
    }
}

Why: Panics in protocol/network/actor paths crash the entire daemon. Result<T, E> forces explicit error handling.

6. Coding Convention from `AGENTS.md`

Pattern:

thiserror for crate-local domain errors (e.g., LedgerError, GossipError)
anyhow at service boundaries (e.g., Gateway API handlers, CLI commands)
Never panic in protocol/network/actor/deserialization paths

Example (anyhow at boundary):

use anyhow::{Context, Result};

pub async fn handle_request(req: Request) -> Result<Response> {
    let entry = parse_entry(&req.body)
        .context("Failed to parse ledger entry")?;
    
    ledger.submit_entry(entry).await
        .context("Failed to submit entry to ledger")?;
    
    Ok(Response::ok())
}

The .context("...") adds human-readable context to errors without changing the error type.

Patterns Introduced

Error Handling Pattern

Used for: Domain-specific errors with structured fields.

→ See patterns.md #2 for full template (thiserror + anyhow).

Receipt Pattern

Used for: Explicit success/failure results (not "accepted" lies).

The quarantine system in icn/crates/icn-ledger/src/ledger.rs is a receipt pattern:

Entries that fail validation aren't silently dropped
They're moved to quarantine with reason
Callers receive explicit rejection, not false success

→ See patterns.md #12 for full template.

Metrics Integration Pattern

Used for: Prometheus metrics at key decision points.

Example from icn/crates/icn-ledger/src/ledger.rs:

metrics::counter!("ledger_entries_total", 
    "status" => "submitted").increment(1);

if validation_fails {
    metrics::counter!("ledger_validation_failures_total",
        "reason" => error_type).increment(1);
}

→ See patterns.md #7 for full template.

What You'll Build

→ Lab: labs/lab-02-error-receipt/

Extend your workspace from Layers 01-02:

Add structured errors with thiserror
Add tracing spans showing request lifecycle
Add receipt pattern: operations return typed receipts, failures include context
Add metrics counters for success/failure paths

Done when: You can see a request flow through tracing output, and errors include structured context.

Checkpoint

You've completed this layer when you can:

Reproduce a bug: Take a failing test, add tracing spans, run it, and explain causality from the output
Instrument a function: Add tracing span, structured error, and metrics to a new function
Explain error patterns: Describe when to use thiserror vs anyhow vs panic
Trace an error end-to-end: Follow an error from validation → quarantine → metrics → gateway → client

Artifact: Submit a tracing output snippet showing an end-to-end request flow with at least 3 nested spans.

Deep Reference

→ reference/module-12-observability.md — Full metrics, tracing, logging guide
→ reference/module-13-security-privacy.md — Security instrumentation patterns
→ docs/architecture/PRODUCTION_HARDENING.md — Error handling best practices
→ tracing crate docs — Tracing span API reference