03: Errors and Tracing — Distributed Visibility
Phase: 1 | Tier: Reader
Patterns introduced: Error Handling, Receipt Pattern, Metrics Integration
Prerequisite: 02-rust-through-icn.md
Why This Matters
"You can't debug what you can't see." Distributed systems multiply this problem: failures happen across nodes, causality is non-obvious, and post-mortems rely on structured logs and metrics.
ICN makes visibility a first-class concern — errors carry structured context, tracing spans show request flows, and metrics expose system health. This layer teaches you to instrument code before building features, not as an afterthought.
→ See manual.md § "Observability First" for the design rationale.
What You'll Read
1. Foundation: icn/crates/icn-core/src/error.rs
This is the simplest error pattern — thiserror for domain errors:
#[derive(Debug, thiserror::Error)]
pub enum CoreError {
#[error("Actor not found: {0}")]
ActorNotFound(String),
#[error("Shutdown in progress")]
ShuttingDown,
#[error(transparent)]
Io(#[from] std::io::Error),
}
Key observations:
#[error("...")]provides human-readable messages#[from]enables?operator for automatic error conversion#[error(transparent)]delegates Display to the inner error- No
unwrap()orexpect()— alwaysResult<T, E>
2. Structured Errors: icn/crates/icn-ledger/src/error.rs
Domain errors add structured fields for better diagnostics:
#[derive(Debug, thiserror::Error)]
pub enum LedgerError {
#[error("Credit limit exceeded: account={account}, attempted={attempted}, limit={limit}")]
CreditLimitExceeded {
account: Did,
attempted: i64,
limit: i64,
},
#[error("Unbalanced entry: sum={sum}")]
UnbalancedEntry { sum: i64 },
#[error("Entry quarantined: {reason}")]
Quarantined { reason: String },
}
Why structured fields matter:
- Errors become queryable (e.g., "Show all credit limit violations for account X")
- Metrics can extract values (e.g., histogram of attempted amounts)
- Logs preserve context without string parsing
3. Source Location Tracking: icn/crates/icn-ccl/src/error.rs
Contract execution errors include source spans (line/column) for debugging:
#[derive(Debug, thiserror::Error)]
pub enum CclError {
#[error("Type mismatch at {span:?}: expected {expected}, got {actual}")]
TypeMismatch {
span: Span,
expected: String,
actual: String,
},
#[error("Out of fuel at {span:?}")]
OutOfFuel { span: Span },
}
pub struct Span {
pub start: usize,
pub end: usize,
}
Why spans matter: When a contract fails, developers need to know where in the source code the failure occurred. Spans enable IDE integration, syntax highlighting of errors, and precise debugging.
4. End-to-End Error Flow
Trace this sequence:
Validation:
Ledger::validate_entry()inicn/crates/icn-ledger/src/ledger.rs:487-520if account_balance + amount < credit_limit { return Err(LedgerError::CreditLimitExceeded { account: posting.account.clone(), attempted: amount, limit: credit_limit, }); }Quarantine: Failed entry is moved to quarantine (not dropped)
self.quarantine.insert(entry.id.clone(), entry);Metrics: Failure increments counter
metrics::counter!("ledger_validation_failures_total", "reason" => "credit_limit_exceeded").increment(1);Gateway: Error propagates to REST API in
icn-gateway/src/api/ledger.rsErr(e) => { tracing::error!(error = %e, "Entry validation failed"); ApiError::ValidationFailed(e.to_string()) }Client: Receives structured JSON error with code + details
{ "error": { "code": "VALIDATION_FAILED", "message": "Credit limit exceeded: ...", "details": { "account": "...", "attempted": 1000, "limit": 500 } } }
5. The Lint Rule: Never Panic in Production
File: icn/crates/icn-core/Cargo.toml and all crate manifests
[lints.clippy]
unwrap_used = "deny"
expect_used = "deny"
Escape hatch for tests:
#![deny(clippy::unwrap_used, clippy::expect_used)]
#[cfg(test)]
mod tests {
// Tests can use unwrap/expect via this attribute
#![allow(clippy::unwrap_used, clippy::expect_used)]
#[test]
fn test_happy_path() {
let result = some_operation().unwrap(); // OK in tests
assert_eq!(result, expected);
}
}
Why: Panics in protocol/network/actor paths crash the entire daemon. Result<T, E> forces explicit error handling.
6. Coding Convention from AGENTS.md
Pattern:
- thiserror for crate-local domain errors (e.g.,
LedgerError,GossipError) - anyhow at service boundaries (e.g., Gateway API handlers, CLI commands)
- Never panic in protocol/network/actor/deserialization paths
Example (anyhow at boundary):
use anyhow::{Context, Result};
pub async fn handle_request(req: Request) -> Result<Response> {
let entry = parse_entry(&req.body)
.context("Failed to parse ledger entry")?;
ledger.submit_entry(entry).await
.context("Failed to submit entry to ledger")?;
Ok(Response::ok())
}
The .context("...") adds human-readable context to errors without changing the error type.
Patterns Introduced
Error Handling Pattern
Used for: Domain-specific errors with structured fields.
→ See patterns.md #2 for full template (thiserror + anyhow).
Receipt Pattern
Used for: Explicit success/failure results (not "accepted" lies).
The quarantine system in icn/crates/icn-ledger/src/ledger.rs is a receipt pattern:
- Entries that fail validation aren't silently dropped
- They're moved to quarantine with reason
- Callers receive explicit rejection, not false success
→ See patterns.md #12 for full template.
Metrics Integration Pattern
Used for: Prometheus metrics at key decision points.
Example from icn/crates/icn-ledger/src/ledger.rs:
metrics::counter!("ledger_entries_total",
"status" => "submitted").increment(1);
if validation_fails {
metrics::counter!("ledger_validation_failures_total",
"reason" => error_type).increment(1);
}
→ See patterns.md #7 for full template.
What You'll Build
→ Lab: labs/lab-02-error-receipt/
Extend your workspace from Layers 01-02:
- Add structured errors with thiserror
- Add tracing spans showing request lifecycle
- Add receipt pattern: operations return typed receipts, failures include context
- Add metrics counters for success/failure paths
Done when: You can see a request flow through tracing output, and errors include structured context.
Checkpoint
You've completed this layer when you can:
- Reproduce a bug: Take a failing test, add tracing spans, run it, and explain causality from the output
- Instrument a function: Add tracing span, structured error, and metrics to a new function
- Explain error patterns: Describe when to use thiserror vs anyhow vs panic
- Trace an error end-to-end: Follow an error from validation → quarantine → metrics → gateway → client
Artifact: Submit a tracing output snippet showing an end-to-end request flow with at least 3 nested spans.
Deep Reference
→ reference/module-12-observability.md — Full metrics, tracing, logging guide
→ reference/module-13-security-privacy.md — Security instrumentation patterns
→ docs/architecture/PRODUCTION_HARDENING.md — Error handling best practices
→ tracing crate docs — Tracing span API reference