Phase 16A: Scheduler Evolution Foundation

Date: 2025-11-23 Phase: 16A - Resource Profiles & Matching Status: Complete ✅ Duration: 1 session (~4 hours)

Overview

Implemented the foundation for ICN's evolution from reactive task claiming (Phase 15) to intelligent, trust-governed distributed scheduling. Phase 16A introduces concrete resource requirements, capacity tracking, and a pluggable placement policy framework.

This is the first of five phases transforming ICN compute into a multi-tier cooperative fabric supporting edge/community/regional workloads with actor migration and per-coop policies.

Motivation

Current State (Phase 15):

Tasks broadcast via gossip, first executor with spare capacity claims
Only fuel limits specified, no CPU/RAM/GPU requirements
No awareness of data locality or network topology
Stateless execution (CCL runs once, returns result)

Gap for Vision: ICN's long-term compute architecture requires:

Multi-tier fabric: Unified edge/community/regional scheduling
Resource awareness: Match tasks to appropriate executors (GPU for ML, etc.)
Locality optimization: Minimize data transfer for batch workloads
Actor migration: Long-running stateful programs that survive failures
Cooperative policies: Per-coop rules enforced at runtime

Phase 16A provides the foundation types for this evolution.

Implementation

1. Scheduler Module (`icn-compute/src/scheduler.rs` - 700+ lines)

Core Types

ResourceProfile: Concrete task requirements

pub struct ResourceProfile {
    pub cpu_cores: Option<f64>,      // 0.5 = half a core
    pub memory_mb: Option<u64>,
    pub storage_mb: Option<u64>,
    pub network_mbps: Option<f64>,
    pub gpu_spec: Option<GpuSpec>,
    pub duration_estimate: Option<Duration>,
}

// Convenience constructors
ResourceProfile::minimal()                    // 0.1 CPU, 128 MB
ResourceProfile::compute_heavy(4.0, 8192)     // 4 cores, 8 GB
ResourceProfile::gpu(24, "sm_70".into())      // 24 GB GPU

NodeCapacity: Track and reserve resources

pub struct NodeCapacity {
    pub cpu_cores_available: f64,
    pub memory_mb_available: u64,
    pub storage_mb_available: u64,
    pub network_mbps: f64,
    pub gpu_devices: Vec<GpuDevice>,
    pub updated_at: u64,
}

impl NodeCapacity {
    pub fn can_fit(&self, profile: &ResourceProfile) -> bool;
    pub fn reserve(&mut self, profile: &ResourceProfile) -> Result<()>;
    pub fn release(&mut self, profile: &ResourceProfile);
    pub fn available_ratio(&self) -> f64;  // 0.0 = full, 1.0 = empty
}

GpuSpec/GpuDevice: GPU-specific matching

pub struct GpuSpec {
    pub memory_gb: u64,
    pub compute_capability: String,  // "sm_70", "sm_80", etc.
    pub device_count: usize,
}

pub struct GpuDevice {
    pub device_id: String,
    pub memory_gb: u64,
    pub compute_capability: String,
    pub device_name: String,           // "NVIDIA A100"
    pub available: bool,
}

GPU matching compares compute capabilities lexicographically (sm_80 >= sm_70).

Placement Policy Framework

PlacementPolicy Trait: Pluggable scoring algorithms

pub trait PlacementPolicy: Send + Sync {
    fn score_task(
        &self,
        task_hash: &[u8; 32],
        profile: &ResourceProfile,
        submitter: &str,
        node_state: &NodeState,
        trust_score: f64,
    ) -> Option<PlacementOffer>;
}

DefaultPlacementPolicy: Multi-factor scoring

impl PlacementPolicy for DefaultPlacementPolicy {
    fn score_task(...) -> Option<PlacementOffer> {
        // 1. Trust gate (MIN_TRUST_EXECUTE = 0.3)
        if trust_score < self.min_trust { return None; }

        // 2. Capacity check
        if !node_state.capacity.can_fit(profile) { return None; }

        // 3. Compute score (0.0 - 1.0)
        let mut score = 0.0;
        score += (trust_score * 0.4).min(0.4);          // Trust (0.4)
        score += node_state.capacity.available_ratio() * 0.3;  // Capacity (0.3)
        score += (1.0 - queue_penalty) * 0.2;          // Queue (0.2)
        score += rand::thread_rng().gen::<f64>() * 0.1; // Jitter (0.1)

        Some(PlacementOffer { executor, score, cost, estimated_start })
    }
}

Scoring factors:

Trust (40%): Higher trust = higher score (respects ICN's trust-first philosophy)
Capacity (30%): More available resources = higher score
Queue depth (20%): Shorter queue = higher score
Random jitter (10%): Break ties, prevent thundering herd

Future Protocol Types

PlacementRequest/PlacementOffer: Gossip-based negotiation (Phase 16B)

pub struct PlacementRequest {
    pub task_hash: [u8; 32],
    pub resource_profile: ResourceProfile,
    pub locality_hints: Vec<LocalityHint>,
    pub max_cost: Option<u64>,
    pub requested_at: u64,
}

pub struct PlacementOffer {
    pub executor: String,
    pub score: f64,              // 0.0 - 1.0
    pub cost: u64,               // Credits per 1000 fuel
    pub estimated_start: u64,
    pub offered_at: u64,
}

LocalityHint: Placement preferences (Phase 16C)

pub enum LocalityHint {
    PreferDid(String),              // Prefer specific executor
    PreferRegion(String),           // Geographic preference
    DataLocality(Vec<[u8; 32]>),   // Near data blobs
    AvoidDid(String),               // Blacklist
    ColocateWith([u8; 32]),         // Same node as another task
}

2. Integration

Module exported in icn-compute/src/lib.rs:

pub use scheduler::{
    DefaultPlacementPolicy, GpuDevice, GpuSpec, LocalityHint,
    NodeCapacity, NodeState, PlacementOffer, PlacementPolicy,
    PlacementRequest, ResourceProfile,
};

Backward Compatibility:

Existing Phase 15 tasks work unchanged
New tasks can optionally specify ResourceProfile
Executors without capacity tracking use legacy logic
Migration is opt-in, not forced

3. Working Example (`examples/scheduler_demo.rs`)

Demonstrates ML training job placement:

Scenario: GPU task (24GB, sm_70+) across 3 executors

Executors:

A: NVIDIA A100 (40GB, sm_80), trust 0.85, queue 5 (busy)
B: RTX 2080 Ti (24GB, sm_75), trust 0.65, queue 1 (available)
C: No GPU, trust 0.50, queue 0 (rejected)

Placement Outcome:

executor-a: score = 0.564 (high trust, but busy)
executor-b: score = 0.680 (balanced, available) ← WINNER
executor-c: REJECTED (no GPU)

The system correctly picks the available executor with adequate resources over the more powerful but busy one.

Run with:

cargo run --example scheduler_demo

Testing

7 new tests, all passing:

test_resource_profile_minimal - Default profile construction
test_resource_profile_validation - Input validation (CPU/RAM limits)
test_node_capacity_can_fit - Capacity matching logic
test_node_capacity_reserve_release - Reservation mechanics
test_default_placement_policy - Scoring algorithm
test_gpu_capacity_matching - GPU compute capability matching
Integration with existing 40 Phase 15 tests

Total: 47 tests passing in icn-compute

Test Coverage:

Resource profile validation (bounds checking)
Capacity matching (CPU, RAM, storage, GPU)
Reservation/release (prevents double-allocation)
Scoring algorithm (trust gates, multi-factor)
GPU-specific allocation (compute capability comparison)

Documentation

Comprehensive Plan (docs/scheduler-evolution-plan.md - 8,800+ words):

Vision Restatement: Trust-governed, multi-tier cooperative fabric
5-Phase Roadmap:
- 16A: Resource profiles (COMPLETE)
- 16B: Placement scoring with deliberation windows (2-3 weeks)
- 16C: Locality awareness and topology (3-4 weeks)
- 16D: Actor state and migration (4-6 weeks)
- 16E: Cooperative scheduling policies (3-4 weeks)
Integration Examples: End-to-end ML training job flow
Testing Strategy: Per-phase test requirements
Metrics & Observability: Prometheus metrics plan
Open Questions: Technical, economic, social trade-offs

Updated Project Docs:

ROADMAP.md: Added Phase 16 with 5-phase breakdown
CHANGELOG.md: Added Phase 16A completion entry
This dev journal entry

Key Design Decisions

1. Incremental Evolution

Each phase builds on the previous without disruption. Phase 16A is 100% backward compatible—existing tasks continue to work.

Rationale: Respects ICN's pilot-driven philosophy. Scheduler evolution shouldn't block pilot deployment.

2. Trust-First Gating

Trust score remains the primary gate. Resource matching is a secondary filter.

Rationale: ICN is built for trust-first communities, not anonymous markets. Resource awareness shouldn't override trust-based access control.

3. Gossip-Based Negotiation

No central scheduler. Executors independently score tasks and claim highest-score wins.

Rationale: Avoids single point of failure, aligns with ICN's distributed architecture.

4. Multi-Factor Scoring

Balances trust (40%), capacity (30%), queue (20%), jitter (10%).

Rationale:

Trust dominates (ICN's core principle)
Capacity prevents overcommitment
Queue optimizes latency
Jitter prevents thundering herd

5. Optional GPU Support

GPU matching via compute capability (sm_70, sm_80, etc.).

Rationale: ML/batch workloads are a key use case for cooperative compute. GPU support enables ML training, rendering, scientific computing.

6. Pluggable Policies

PlacementPolicy trait allows custom scoring algorithms.

Rationale: Different cooperatives have different priorities (latency vs cost, local vs distributed, etc.). Trait enables experimentation without forking core.

Performance Characteristics

Overhead:

Capacity checking: O(1) per executor
Scoring: O(1) computation per executor
No network overhead yet (Phase 16B adds gossip messages)

Memory:

ResourceProfile: ~80 bytes
NodeCapacity: ~200 bytes + GPU devices
Per-executor overhead: ~1KB

Scalability:

Phase 16A adds no new gossip traffic
Scales with existing Phase 15 performance
Phase 16B adds PlacementRequest/PlacementOffer messages (estimated 500 bytes each)

Next Steps (Phase 16B - 2-3 weeks)

Implementation Tasks

Gossip Message Types:
- Add PlacementRequest and PlacementOffer to ComputeMessage enum
- Implement serialization (bincode)
Deliberation Window:
- 500ms delay after receiving PlacementRequest
- Collect competing offers, pick highest score
- Prevent race conditions (multiple executors claiming)
Actor Integration:
- Extend ComputeActor with placement_policy: Box<dyn PlacementPolicy>
- Add on_placement_request handler
- Add on_placement_offer handler
Metrics:
- icn_compute_placement_offers_sent_total
- icn_compute_placement_wins_total
- icn_compute_placement_losses_total
- icn_compute_placement_score (histogram)
Integration Test:
- Spawn 5 executors with different capacities/trust
- Submit task with resource profile
- Verify highest-score executor wins
- Check no double-claims (deliberation prevents races)

Design Questions for Phase 16B

Deliberation Duration: 500ms balance between latency and coordination?
Offer Expiration: Should offers expire if not claimed within N seconds?
Conflict Resolution: What if two executors have identical scores?
Cost Adjustment: Should cost increase with queue depth? (current: yes, 1.5x multiplier)

Migration Path

Dual Protocol Support:

Phase 15 TaskSubmitted → instant claim (legacy)
Phase 16B PlacementRequest → deliberation → claim (new)

Submitters can choose protocol. After 6 months, deprecate TaskSubmitted.

Lessons Learned

What Went Well

Clear Vision Document: Starting with comprehensive vision statement aligned implementation
Incremental Design: 5-phase plan makes complexity manageable
Backward Compatibility: No disruption to existing system
Working Demo: scheduler_demo.rs validates design early
GPU Support: Compute capability matching was straightforward

Challenges

Trait Design: Took iterations to get PlacementPolicy trait signature right
Scoring Weights: Arbitrary weights (0.4/0.3/0.2/0.1) need empirical validation
Random Jitter: Using rand feels hacky, but necessary for tie-breaking
GPU Enumeration: Real GPU discovery (CUDA/OpenCL) deferred to OS integration

Future Improvements

Adaptive Scoring: Learn optimal weights from historical data
Simulator: Benchmark scoring algorithms against synthetic workloads
GPU Abstraction: Support AMD ROCm, Apple Metal, not just NVIDIA
Capacity Discovery: Auto-detect system resources (sysinfo crate)

Impact Assessment

Substrate Readiness

Before Phase 16A:

✅ Trust-gated task execution
✅ Priority levels
✅ Payment settlement
❌ Resource awareness
❌ Intelligent placement
❌ Locality optimization

After Phase 16A:

✅ Resource profiles (CPU/RAM/GPU/storage)
✅ Capacity tracking and reservation
✅ Pluggable placement policies
✅ Foundation for intelligent scoring
⏳ Gossip protocol (Phase 16B)
⏳ Locality awareness (Phase 16C)

Path to Vision

Completed: 20% of scheduler evolution (Phase 16A)

Remaining:

Phase 16B: Placement scoring (15%)
Phase 16C: Locality awareness (20%)
Phase 16D: Actor migration (30%)
Phase 16E: Cooperative policies (15%)

Timeline: 3-6 months to complete vision (Phases 16B-E)

Pilot Implications

Can Deploy Now:

Phase 16A doesn't block pilot deployment
Backward compatible with Phase 15 tasks
Provides foundation for future optimizations

Pilot Feedback Loop:

Phase 16B-C: Deploy during pilot, measure impact
Phase 16D-E: Build only if actor use cases emerge

Philosophy: Let pilot reveal whether advanced scheduling matters in practice.

Conclusion

Phase 16A successfully establishes the foundation for ICN's scheduler evolution. The implementation is clean, well-tested, and backward compatible. The 5-phase roadmap provides a clear path from reactive claiming to intelligent, trust-governed distributed scheduling.

Key Achievement: Proved that ICN can evolve toward sophisticated scheduling without disrupting existing functionality. This incremental approach respects the pilot-driven development philosophy while laying groundwork for the long-term vision of a multi-tier cooperative fabric.

Next Milestone: Phase 16B (Placement Scoring) - 2-3 weeks to production-ready gossip-based task placement with deliberation windows.

Files Modified:

icn/crates/icn-compute/src/scheduler.rs (new, 700+ lines)
icn/crates/icn-compute/src/lib.rs (exports)
icn/crates/icn-compute/Cargo.toml (added rand dependency)
icn/crates/icn-compute/examples/scheduler_demo.rs (new demo)
docs/scheduler-evolution-plan.md (new, 8,800+ words)
ROADMAP.md (added Phase 16, updated last modified)
CHANGELOG.md (added Phase 16A entry)
docs/dev-journal/2025-11-23-phase-16a-scheduler-foundation.md (this file)

Test Results:

$ cargo test -p icn-compute
test result: ok. 47 passed; 0 failed; 0 ignored

Commits Recommended:

git add icn/crates/icn-compute/
git add docs/scheduler-evolution-plan.md
git add ROADMAP.md CHANGELOG.md
git add docs/dev-journal/2025-11-23-phase-16a-scheduler-foundation.md
git commit -m "feat(compute): Phase 16A - scheduler evolution foundation

- Add ResourceProfile type for concrete resource requirements
- Implement NodeCapacity tracking and reservation system
- Create PlacementPolicy trait for pluggable scoring
- Add DefaultPlacementPolicy with multi-factor scoring
- Support GPU matching via compute capability
- 7 new tests, all passing (47 total)
- Comprehensive design doc (8,800+ words)
- Working demo: scheduler_demo.rs
- Backward compatible with Phase 15

Phase 16A establishes foundation for intelligent, trust-governed
distributed scheduling. First of 5 phases evolving ICN compute into
multi-tier cooperative fabric (edge/community/regional) with actor
migration and per-coop policies.

Next: Phase 16B (Placement Scoring) - 2-3 weeks

🤖 Generated with Claude Code"