Phase 14 Gateway API - Production Hardening

Date: 2025-11-16 Phase: 14 (Platform Layer) Focus: Gateway API production hardening and security improvements

Overview

Completed production hardening for the ICN Gateway API, implementing critical security and scalability features. This work transforms the gateway from a functional prototype into a production-ready API server with comprehensive access control and abuse prevention.

Goals

  • ✅ Add API versioning for future evolution
  • ✅ Implement per-DID rate limiting to prevent abuse
  • ✅ Enforce scope-based authorization on all endpoints
  • ✅ Fix cooperative ownership to use authenticated DID

Implementation

1. API Versioning (/v1 Namespacing)

Commit: db1bfc2 (partial), 9a12f76 (middleware fix)

Changes:

  • Wrapped all endpoints under /v1 scope
  • Split into two /v1 scopes: public and protected
  • Public scope: /health, /auth/*, /ws/:coop_id
  • Protected scope: /coops/*, /ledger/* (with auth + rate limiting)

Architecture Decision:

// API v1 - public endpoints (no auth required)
.service(
    web::scope("/v1")
        .service(api::health::health)
        .service(api::auth::challenge)
        .service(api::auth::verify)
        .service(api::websocket::websocket)
)

// API v1 - protected endpoints (auth + rate limiting)
.service(
    web::scope("/v1")
        // ... protected endpoints ...
        .wrap(middleware::from_fn(rate_limit_middleware))  // Runs second
        .wrap(auth)  // Runs first
)

Rationale:

  • Enables backward-compatible API changes in future versions
  • Clean migration path: v1 → v2 without breaking existing clients
  • Follows REST best practices (Stripe, GitHub, etc.)

Critical Bug Fix: Initial implementation had incorrect middleware order - rate limiting wrapped before auth, causing rate limiting to be completely skipped (early return when TokenClaims missing). Fixed by creating separate scopes and applying middleware in correct order.

2. Per-DID Rate Limiting

Commit: db1bfc2

Implementation: Token bucket algorithm with configurable parameters

  • Capacity: 100 tokens (burst capacity)
  • Refill rate: 10 tokens/second (600 requests/minute sustained)
  • Cost per request: 1 token
  • Per-DID tracking: Independent buckets using Arc<RwLock<HashMap<String, TokenBucket>>>

Code Structure:

struct TokenBucket {
    tokens: f64,
    capacity: f64,
    refill_rate: f64,  // tokens per second
    last_refill: Instant,
}

pub struct RateLimiter {
    buckets: Arc<RwLock<HashMap<String, TokenBucket>>>,
    config: RateLimitConfig,
}

Algorithm:

  1. Refill tokens based on elapsed time: tokens = min(tokens + elapsed * refill_rate, capacity)
  2. Try to consume tokens: if available >= cost, deduct and allow; else reject
  3. Automatic cleanup of inactive buckets prevents unbounded memory growth

Integration:

  • Middleware extracts DID from TokenClaims (inserted by JWT auth middleware)
  • Returns HTTP 429 Too Many Requests when limit exceeded
  • Public endpoints bypass rate limiting (no TokenClaims present)

Testing: 5 comprehensive tests

  • test_token_bucket_basic - Basic consumption and rejection
  • test_token_bucket_refill - Time-based refill (500ms sleep)
  • test_token_bucket_cap - Capacity capping
  • test_rate_limiter_per_did - Per-DID isolation
  • test_rate_limiter_cleanup - Inactive bucket cleanup

Floating-Point Precision Handling: Tests use range checks instead of exact equality to handle automatic refill during test execution:

// Allow small variance due to refill during test execution
assert!(after_first >= 4.9 && after_first <= 5.1);

3. Scope-Based Authorization

Commit: 0119a06

Implementation: require_scope() helper validates JWT scopes against required permissions

Scope Hierarchy:

  • ledger:read - Balance queries and transaction history
  • ledger:write - Payment creation
  • coop:read - View cooperative information
  • coop:write - Create cooperatives
  • coop:admin - Member management and settings changes

Code Pattern:

pub fn require_scope(req: &HttpRequest, required_scope: &str) -> Result<(), GatewayError> {
    let claims = get_claims(req)
        .ok_or_else(|| GatewayError::AuthenticationFailed("No claims found".to_string()))?;

    if !claims.scopes.contains(&required_scope.to_string()) {
        return Err(GatewayError::AuthorizationFailed(
            format!("Missing required scope: {}", required_scope)
        ));
    }

    Ok(())
}

Applied to All Handlers:

  • get_balanceledger:read
  • create_paymentledger:write
  • get_historyledger:read
  • get_coopcoop:read
  • create_coopcoop:write
  • update_settingscoop:admin
  • delete_coopcoop:admin
  • add_membercoop:admin
  • remove_membercoop:admin
  • update_member_rolecoop:admin

Testing: 2 authorization failure tests

  • test_authorization_scope_check (ledger) - Wrong scopes rejected with 403
  • test_authorization_scope_check (coops) - Wrong scopes rejected with 403

Test Fixes: All existing tests updated to include proper scopes in TokenClaims:

  • Added missing iat field (issued at timestamp)
  • Added HttpMessage import for extensions_mut() access
  • Created TokenClaims with appropriate scopes for each operation

4. Authenticated DID Extraction for Ownership

Commit: 1ac1ed2

Problem: create_coop handler generated placeholder DIDs instead of using authenticated user's DID

Fix:

// Extract owner DID from authenticated token
use crate::middleware::get_claims;
let claims = get_claims(&http_req)
    .ok_or_else(|| GatewayError::AuthenticationFailed("No claims found".to_string()))?;

let owner: icn_identity::Did = claims.sub.parse()
    .map_err(|e| GatewayError::BadRequest(format!("Invalid DID in token: {}", e)))?;

Security Benefits:

  • Prevents creation of cooperatives with arbitrary/random owners
  • Ensures cooperative owner matches authenticated user
  • Proper authorization chain: auth → scope check → owner extraction

Testing: 1 ownership verification test

  • test_create_coop_uses_authenticated_did - Verifies Alice's DID becomes owner when she creates a coop

Test Results

Final Stats: 38 tests passing

  • 5 rate limiting tests
  • 2 authorization failure tests
  • 1 ownership verification test
  • 30 existing tests (updated with proper TokenClaims)

Test Reliability:

  • All tests pass consistently
  • Floating-point precision issues resolved with range checks
  • No flaky tests or timing dependencies (except intentional sleep in refill test)

Architecture Patterns

Middleware Composition

Critical Lesson: Middleware execution order matters!

  • Wrapping order: last wrapped runs first
  • Correct order: .wrap(rate_limit).wrap(auth) → auth runs first, then rate_limit
  • Rate limiting requires TokenClaims from auth middleware

Request Extensions

Pattern for passing data between middleware and handlers:

// In middleware: insert claims
req.extensions_mut().insert(claims);

// In handler: extract claims
let claims = req.extensions().get::<TokenClaims>().cloned();

Error Handling

Consistent error types with HTTP status mapping:

  • AuthenticationFailed → 401 Unauthorized
  • AuthorizationFailed → 403 Forbidden
  • RateLimitExceeded → 429 Too Many Requests
  • BadRequest → 400 Bad Request

Security Model

Three-Layer Security:

  1. Authentication (JWT middleware)

    • Verifies bearer token
    • Extracts and validates claims
    • Inserts TokenClaims into request extensions
  2. Rate Limiting (per-DID middleware)

    • Prevents abuse and resource exhaustion
    • Fair allocation across DIDs
    • Configurable limits per deployment
  3. Authorization (handler-level)

    • Scope-based access control
    • Fine-grained permissions
    • Prevents privilege escalation

Execution Flow:

Request → JWT Auth → Rate Limiting → Authorization → Handler
             ↓            ↓              ↓
         Insert       Check DID      Check Scope
         Claims       Limit          Requirement

Production Readiness

Abuse Prevention:

  • ✅ Rate limiting prevents API flooding
  • ✅ Scope checking prevents privilege escalation
  • ✅ Token expiration (1 hour TTL)
  • ✅ Challenge expiration (5 minutes TTL)

Scalability:

  • Token bucket algorithm: O(1) per request
  • Per-DID isolation prevents noisy neighbor problem
  • Automatic cleanup prevents memory growth
  • Arc/RwLock enables multi-threaded access

Observability:

  • HTTP status codes follow standards (401, 403, 429)
  • Error messages include context
  • Rate limit errors include DID for debugging

Evolution Path:

  • /v1 namespace enables backward-compatible changes
  • Scope system allows adding new permissions
  • Rate limit config allows per-deployment tuning

Remaining Work (Deferred)

WebSocket Improvements (deferred until pilot selection):

  • Reconnection handling
  • Event backfill for missed events

TypeScript SDK (deferred until pilot selection):

  • @icn/client npm package
  • Don't build speculatively - build what pilots need

Reference Application (deferred until pilot selection):

  • Timebank or other pilot-specific app

Lessons Learned

  1. Middleware order matters - Cost us a critical bug that completely bypassed rate limiting
  2. Floating-point tests need ranges - Exact equality fails due to timing variations
  3. Test coverage reveals bugs - Authorization failure tests exposed missing validation
  4. Phase incrementally - Each feature added separately with full testing
  5. Use existing patterns - TokenClaims in request extensions works well

Next Steps

Track C1: Pilot Community Selection & Deployment

  • Select pilot community for initial deployment
  • Build TypeScript SDK for their specific workflows
  • Deploy gateway with pilot-specific configuration
  • Run weekly learning loop to gather feedback

Philosophy: The substrate is ready. Now we listen to communities and build what they need.

Files Modified

  • icn/crates/icn-gateway/src/server.rs - API versioning and middleware ordering
  • icn/crates/icn-gateway/src/rate_limit.rs - NEW file with rate limiting
  • icn/crates/icn-gateway/src/error.rs - Added RateLimitExceeded error
  • icn/crates/icn-gateway/src/lib.rs - Exported rate_limit module
  • icn/crates/icn-gateway/src/middleware.rs - Added require_scope helper
  • icn/crates/icn-gateway/src/api/ledger.rs - Added scope checks to all handlers
  • icn/crates/icn-gateway/src/api/coops.rs - Added scope checks and DID extraction
  • CHANGELOG.md - Documented all Phase 14 improvements
  • ROADMAP.md - Updated Phase 14 status

Commits

  • db1bfc2 - feat(gateway): Add API versioning and per-DID rate limiting
  • 87cacf5 - docs: Update CHANGELOG and ROADMAP
  • 9a12f76 - fix(gateway): Correct middleware execution order
  • 0119a06 - feat(gateway): Add scope-based authorization enforcement
  • 1ac1ed2 - fix(gateway): Use authenticated DID as cooperative owner

Conclusion

Phase 14 production hardening is complete. The gateway is now production-ready with:

  • ✅ API versioning for evolution
  • ✅ Rate limiting for abuse prevention
  • ✅ Authorization for access control
  • ✅ Authenticated ownership for security

All 38 tests passing. Ready for pilot deployment.