ICN Service Level Objectives (SLOs)

Last Updated: 2026-01-08 Review Cadence: Quarterly

Overview

This document defines Service Level Objectives for ICN daemon deployments. SLOs provide measurable reliability targets that guide operations and incident response.

SLO Summary

Service	Metric	Target	Error Budget (30d)
Gateway API	Availability	99.9%	43.2 min
Gossip Network	Message Delivery	99.5%	3.6 hours
Ledger Sync	Consistency	99.9%	43.2 min
API Latency	p99 Response Time	< 500ms	N/A
Recovery	RTO	< 15 min	N/A
Recovery	RPO	< 1 min	N/A

Availability SLOs

Gateway API Availability

Target: 99.9% uptime (monthly)

Definition: Percentage of time the Gateway API responds to health checks successfully.

Measurement:

# Availability over 30 days
1 - (
  sum(rate(icn_gateway_requests_total{status=~"5.."}[30d])) /
  sum(rate(icn_gateway_requests_total[30d]))
)

Error Budget: 43.2 minutes per month

Alert Thresholds:

Severity	Condition
Warning	< 99.95% (7-day rolling)
Critical	< 99.9% (7-day rolling)
Page	< 99.5% (1-hour rolling)

Gossip Message Delivery

Target: 99.5% message delivery

Definition: Percentage of gossip messages successfully delivered to at least one peer.

Measurement:

sum(rate(icn_gossip_messages_delivered_total[30d])) /
sum(rate(icn_gossip_messages_sent_total[30d]))

Error Budget: 3.6 hours of message loss per month

Alert Thresholds:

Severity	Condition
Warning	Delivery < 99% (1-hour)
Critical	Delivery < 95% (15-min)

Ledger Consistency

Target: 99.9% consistency

Definition: Percentage of time ledger state matches across connected peers.

Measurement:

# Peers with matching entry counts
sum(icn_ledger_peer_sync_match) / count(icn_ledger_peer_sync_match)

Error Budget: 43.2 minutes of inconsistency per month

Latency SLOs

API Response Time

Target: p99 < 500ms

Definition: 99th percentile response time for Gateway API requests.

Measurement:

histogram_quantile(0.99,
  sum(rate(icn_gateway_request_duration_seconds_bucket[5m])) by (le)
)

Alert Thresholds:

Severity	Condition
Warning	p99 > 400ms (5-min)
Critical	p99 > 500ms (5-min)
Page	p99 > 1s (5-min)

Gossip Propagation

Target: p95 < 2 seconds

Definition: Time for a message to reach 95% of subscribed peers.

Measurement:

histogram_quantile(0.95,
  sum(rate(icn_gossip_propagation_seconds_bucket[5m])) by (le)
)

Trust Computation

Target: p99 < 100ms

Definition: Time to compute transitive trust score for a DID.

Measurement:

histogram_quantile(0.99,
  sum(rate(icn_trust_compute_duration_seconds_bucket[5m])) by (le)
)

Ledger Entry Creation

Target: p95 < 1 second

Definition: Time from entry submission to local persistence.

Measurement:

histogram_quantile(0.95,
  sum(rate(icn_ledger_entry_duration_seconds_bucket[5m])) by (le)
)

Data SLOs

Ledger Eventual Consistency

Target: Convergence within 30 seconds

Definition: Maximum time for all peers to reflect a committed entry.

Measurement:

max(icn_ledger_sync_lag_seconds)

Alert Threshold: Lag > 60 seconds

Trust Score Freshness

Target: < 5 minute staleness

Definition: Maximum age of cached trust scores.

Measurement:

max(icn_trust_cache_age_seconds)

No Data Loss

Target: 0 committed entries lost

Definition: Once an entry is acknowledged, it must never be lost.

Measurement: Compare entry hashes across backups and peers.

Alert: Any discrepancy is Critical.

Recovery SLOs

Recovery Time Objective (RTO)

Target: < 15 minutes

Definition: Maximum time from failure detection to service restoration.

Includes:

Detection time (< 2 min with alerting)
Response time (< 5 min for on-call)
Recovery time (< 8 min for restart/restore)

Recovery Point Objective (RPO)

Target: < 1 minute

Definition: Maximum data loss in a failure scenario.

Achieved via:

Gossip replication (real-time)
State snapshots (configurable interval)
Backups (recommended: hourly)

Incident Severity Mapping

Severity	SLO Impact	Response
P1 Critical	Multiple SLOs breached	Immediate page, all hands
P2 High	Single SLO breached	Page on-call
P3 Medium	Error budget depleted >50%	Respond within 4 hours
P4 Low	Error budget depleted >25%	Respond within 24 hours

Error Budget Policy

When Error Budget is Healthy (> 50%)

Normal development velocity
Deploy at will during business hours
Experiment with new features

When Error Budget is Low (25-50%)

Prioritize reliability work
Require rollback plans for deploys
No non-critical experiments

When Error Budget is Exhausted (< 25%)

Freeze non-critical changes
Focus exclusively on reliability
Incident review required before resuming

SLO Dashboard

Grafana dashboard should display:

Current SLO Status - Green/Yellow/Red for each SLO
Error Budget Remaining - Percentage and time remaining
Burn Rate - How fast error budget is depleting
SLO Trends - 7-day and 30-day compliance

Review Process

Monthly Review

Calculate actual vs target for each SLO
Review incidents that impacted SLOs
Adjust thresholds if needed
Update error budgets

Quarterly Review

Assess if SLO targets are appropriate
Review new capabilities that need SLOs
Update documentation
Communicate changes to stakeholders

Appendix: Prometheus Alerts

# Example alert rules for SLOs
groups:
  - name: slo-alerts
    rules:
      - alert: GatewayAvailabilityLow
        expr: |
          (1 - sum(rate(icn_gateway_requests_total{status=~"5.."}[1h])) /
               sum(rate(icn_gateway_requests_total[1h]))) < 0.999
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Gateway availability below 99.9%

      - alert: APILatencyHigh
        expr: |
          histogram_quantile(0.99,
            sum(rate(icn_gateway_request_duration_seconds_bucket[5m])) by (le)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: API p99 latency above 500ms

These SLOs are targets for pilot deployment. Production deployments may have stricter requirements.