Internal Testing Infrastructure: Pre-Pilot Validation

Overview

Following the completion of Phase 18 (Byzantine Fault Detection), we built comprehensive internal testing infrastructure to validate multi-node behavior before pilot deployment. This ensures the system works correctly under real network conditions with governance, ledger, compute, and Byzantine scenarios.

Motivation

Phase 18 added Byzantine fault detection at the code level, but we needed:

Multi-node validation in realistic deployment
Governance testing with actual voting across nodes
Performance baselines under load
Operational procedure validation
Clear Go/No-Go criteria before external pilot

What We Built

1. Docker-Based Test Environment

Files Created:

Dockerfile - Multi-stage build (rust:slim → debian:trixie-slim)
docker-compose.test.yml - 6-service orchestration
config/node1-4.toml - Per-node TOML configuration

Architecture:

┌─────────────────────────────────────────────────────────────┐
│                    Docker Network (icn_test)                │
├─────────────────────────────────────────────────────────────┤
│  node1:5001     node2:5002     node3:5003     node4:5004   │
│  (honest)       (honest)       (honest)       (byzantine)   │
│  metrics:9091   metrics:9092   metrics:9093   metrics:9094  │
├─────────────────────────────────────────────────────────────┤
│  prometheus:9095              grafana:3000                  │
│  (scrapes all nodes)          (visualization)               │
└─────────────────────────────────────────────────────────────┘

Key Configuration Decisions:

Config files over CLI args: Each node uses /config/nodeX.toml for maintainability
Metrics on port 9100: ICN daemon default, mapped to host ports 9091-9094
Prometheus on 9095: Avoids VS Code conflict on 9090
Byzantine profile: node4 only starts with --profile byzantine

2. Comprehensive Test Plan (38 Scenarios)

File: docs/INTERNAL_TESTING_PLAN.md (1,000+ lines)

Test Categories:

Category	Count	Key Scenarios
Baseline Functionality	10	Network formation, gossip, ledger, governance
Byzantine Detection	10	Invalid signatures, replay, forks, governance attacks
Performance & Load	6	Throughput benchmarks, 24-hour soak test
Resilience	5	Crash recovery, partition healing
Operational	4	Backup/restore, upgrades, incidents

Governance Scenarios (9 total):

Domain creation & gossip sync
Proposal lifecycle (majority passes)
Proposal lifecycle (quorum fails)
WebSocket event delivery
Vote manipulation attack (non-member voting)
Double voting attack
Proposal spam resilience
Conflicting outcomes under partition
Load test (100 proposals, 300 votes)

3. Monitoring Stack

Prometheus Configuration: monitoring/prometheus.yml

Scrapes all 4 nodes at port 9100
15-second interval
Node role labels (honest vs byzantine)

Alert Rules: monitoring/alert_rules.yml (25 rules)

Byzantine detection (4 rules)
Network health (4 rules)
Ledger consistency (3 rules)
Gossip performance (2 rules)
Compute layer (3 rules)
Governance (3 rules)
System resources (3 rules)
Monitoring system (2 rules)

Grafana Dashboard: monitoring/grafana-dashboard.json

Auto-provisioned datasource
Network, Byzantine, Governance panels
Real-time metrics visualization

4. Configuration Validation Script

File: scripts/validate-test-config.sh (170 lines)

Validates:

11 required files exist
Node config files (metrics_port, listen_addr)
Docker Compose port mappings
Dockerfile EXPOSE and HEALTHCHECK
Prometheus scrape targets
Documentation port references
Prerequisites (Docker 24+, Docker Compose 2.20+)

Usage:

./scripts/validate-test-config.sh
# Output: ✓ All checks passed! or list of errors/warnings

5. Documentation Suite

File	Lines	Purpose
INTERNAL_TESTING_PLAN.md	1,000+	Complete test scenarios with success criteria
TESTING_QUICKSTART.md	500+	Step-by-step manual test procedures
DEPLOY_TEST_NETWORK.md	400+	Host system deployment guide

Configuration Issues Resolved

During development, we discovered and fixed 8 Docker configuration issues:

Issue	Fix	Commit
COPY paths wrong for build context	Use relative paths	`1982498`
Missing keystore passphrase	Add ICN_PASSPHRASE env var	`007f043`
Unsupported ICN_DATA_DIR etc	Use CLI arguments	`139d62e`
Unsupported --bind argument	Remove, use config file	`73ad13b`
Kubernetes files cluttering git	Add to .gitignore	`25ee80b`
Node4 inconsistent (no config)	Create node4.toml	`992f517`
Wrong metrics port (9090 vs 9100)	Fix Dockerfile, docs	`992f517`
set -e incompatible with error collection	Remove set -e	`5d101db`

Go/No-Go Criteria

Before proceeding to pilot deployment, all criteria must pass:

Must Pass (8 criteria)

All 38 test scenarios pass
No crashes/panics in 24-hour soak test
Byzantine nodes detected within 1 min SLA
Governance voting works correctly (no vote loss)
No false positives (honest nodes never quarantined)
Ledger consistency maintained (no undetected forks)
Network recovers from partitions <2 min
Stable memory usage (<2 GB/node, no leaks)

Performance Targets

Gossip throughput: 1000 msg/sec
Ledger transactions: 50 tx/sec
Compute task queue: 500 concurrent tasks
Governance load: 100 proposals, 300 votes <5 min

Testing Timeline

Week 1 (Days 1-5): Baseline

Build Docker image
Deploy 3-node network
Run 10 baseline tests
Establish performance baselines

Week 2 (Days 6-12): Comprehensive

Byzantine detection tests (10)
Governance tests (9)
Performance tests (6)
Resilience tests (5)
Operational tests (4)
24-hour soak test
Go/No-Go Decision

Security Considerations

Test Environment:

Hardcoded passphrase: test-passphrase-insecure-do-not-use-in-production
Prominent warnings in all documentation
Isolated Docker network
Non-root container users

Production Differences:

Secure secrets management (Vault, K8s secrets)
TLS termination via reverse proxy
Authentication on Grafana
Long-term metrics retention
HTTPS everywhere

Files Changed

Created (12 files)

Dockerfile
docker-compose.test.yml
config/node1.toml, node2.toml, node3.toml, node4.toml
monitoring/prometheus.yml, alert_rules.yml, grafana-datasource.yml
scripts/validate-test-config.sh
docs/INTERNAL_TESTING_PLAN.md, TESTING_QUICKSTART.md
DEPLOY_TEST_NETWORK.md

Modified (4 files)

CLAUDE.md (added testing infrastructure section)
.gitignore (exclude deploy/k8s/)
monitoring/grafana-dashboard.json (simplified)
Various documentation (port references)

Lessons Learned

Config files > CLI args: Complex services need declarative configuration
Verify actual ports: Don't assume - check what the binary actually uses
Test environment security: Hardcoded credentials need prominent warnings
Error collection pattern: Can't use set -e with error counting
Documentation consistency: Port references must match across all files

Next Steps

Deploy on host system: Run validation script, build image, start network
Execute Week 1 tests: Baseline functionality and performance
Execute Week 2 tests: Byzantine, governance, resilience
Go/No-Go decision: All criteria must pass
Proceed to Track C1: Pilot community selection

Impact

Before: Risk of deploying untested multi-node system to pilot community After: Comprehensive validation with clear success criteria

The internal testing infrastructure bridges the gap between Phase 18 completion and Track C1 pilot deployment, ensuring we don't expose pilot communities to untested code paths.

Status: ✅ Complete Next Action: Deploy on host system and begin 2-week testing timeline Blocks: Track C1 (Pilot Community Selection) pending Go/No-Go decision