ICN Service Level Objectives (SLOs)
Last Updated: 2026-01-08 Review Cadence: Quarterly
Overview
This document defines Service Level Objectives for ICN daemon deployments. SLOs provide measurable reliability targets that guide operations and incident response.
SLO Summary
| Service | Metric | Target | Error Budget (30d) |
|---|---|---|---|
| Gateway API | Availability | 99.9% | 43.2 min |
| Gossip Network | Message Delivery | 99.5% | 3.6 hours |
| Ledger Sync | Consistency | 99.9% | 43.2 min |
| API Latency | p99 Response Time | < 500ms | N/A |
| Recovery | RTO | < 15 min | N/A |
| Recovery | RPO | < 1 min | N/A |
Availability SLOs
Gateway API Availability
Target: 99.9% uptime (monthly)
Definition: Percentage of time the Gateway API responds to health checks successfully.
Measurement:
# Availability over 30 days
1 - (
sum(rate(icn_gateway_requests_total{status=~"5.."}[30d])) /
sum(rate(icn_gateway_requests_total[30d]))
)
Error Budget: 43.2 minutes per month
Alert Thresholds:
| Severity | Condition |
|---|---|
| Warning | < 99.95% (7-day rolling) |
| Critical | < 99.9% (7-day rolling) |
| Page | < 99.5% (1-hour rolling) |
Gossip Message Delivery
Target: 99.5% message delivery
Definition: Percentage of gossip messages successfully delivered to at least one peer.
Measurement:
sum(rate(icn_gossip_messages_delivered_total[30d])) /
sum(rate(icn_gossip_messages_sent_total[30d]))
Error Budget: 3.6 hours of message loss per month
Alert Thresholds:
| Severity | Condition |
|---|---|
| Warning | Delivery < 99% (1-hour) |
| Critical | Delivery < 95% (15-min) |
Ledger Consistency
Target: 99.9% consistency
Definition: Percentage of time ledger state matches across connected peers.
Measurement:
# Peers with matching entry counts
sum(icn_ledger_peer_sync_match) / count(icn_ledger_peer_sync_match)
Error Budget: 43.2 minutes of inconsistency per month
Latency SLOs
API Response Time
Target: p99 < 500ms
Definition: 99th percentile response time for Gateway API requests.
Measurement:
histogram_quantile(0.99,
sum(rate(icn_gateway_request_duration_seconds_bucket[5m])) by (le)
)
Alert Thresholds:
| Severity | Condition |
|---|---|
| Warning | p99 > 400ms (5-min) |
| Critical | p99 > 500ms (5-min) |
| Page | p99 > 1s (5-min) |
Gossip Propagation
Target: p95 < 2 seconds
Definition: Time for a message to reach 95% of subscribed peers.
Measurement:
histogram_quantile(0.95,
sum(rate(icn_gossip_propagation_seconds_bucket[5m])) by (le)
)
Trust Computation
Target: p99 < 100ms
Definition: Time to compute transitive trust score for a DID.
Measurement:
histogram_quantile(0.99,
sum(rate(icn_trust_compute_duration_seconds_bucket[5m])) by (le)
)
Ledger Entry Creation
Target: p95 < 1 second
Definition: Time from entry submission to local persistence.
Measurement:
histogram_quantile(0.95,
sum(rate(icn_ledger_entry_duration_seconds_bucket[5m])) by (le)
)
Data SLOs
Ledger Eventual Consistency
Target: Convergence within 30 seconds
Definition: Maximum time for all peers to reflect a committed entry.
Measurement:
max(icn_ledger_sync_lag_seconds)
Alert Threshold: Lag > 60 seconds
Trust Score Freshness
Target: < 5 minute staleness
Definition: Maximum age of cached trust scores.
Measurement:
max(icn_trust_cache_age_seconds)
No Data Loss
Target: 0 committed entries lost
Definition: Once an entry is acknowledged, it must never be lost.
Measurement: Compare entry hashes across backups and peers.
Alert: Any discrepancy is Critical.
Recovery SLOs
Recovery Time Objective (RTO)
Target: < 15 minutes
Definition: Maximum time from failure detection to service restoration.
Includes:
- Detection time (< 2 min with alerting)
- Response time (< 5 min for on-call)
- Recovery time (< 8 min for restart/restore)
Recovery Point Objective (RPO)
Target: < 1 minute
Definition: Maximum data loss in a failure scenario.
Achieved via:
- Gossip replication (real-time)
- State snapshots (configurable interval)
- Backups (recommended: hourly)
Incident Severity Mapping
| Severity | SLO Impact | Response |
|---|---|---|
| P1 Critical | Multiple SLOs breached | Immediate page, all hands |
| P2 High | Single SLO breached | Page on-call |
| P3 Medium | Error budget depleted >50% | Respond within 4 hours |
| P4 Low | Error budget depleted >25% | Respond within 24 hours |
Error Budget Policy
When Error Budget is Healthy (> 50%)
- Normal development velocity
- Deploy at will during business hours
- Experiment with new features
When Error Budget is Low (25-50%)
- Prioritize reliability work
- Require rollback plans for deploys
- No non-critical experiments
When Error Budget is Exhausted (< 25%)
- Freeze non-critical changes
- Focus exclusively on reliability
- Incident review required before resuming
SLO Dashboard
Grafana dashboard should display:
- Current SLO Status - Green/Yellow/Red for each SLO
- Error Budget Remaining - Percentage and time remaining
- Burn Rate - How fast error budget is depleting
- SLO Trends - 7-day and 30-day compliance
Review Process
Monthly Review
- Calculate actual vs target for each SLO
- Review incidents that impacted SLOs
- Adjust thresholds if needed
- Update error budgets
Quarterly Review
- Assess if SLO targets are appropriate
- Review new capabilities that need SLOs
- Update documentation
- Communicate changes to stakeholders
Appendix: Prometheus Alerts
# Example alert rules for SLOs
groups:
- name: slo-alerts
rules:
- alert: GatewayAvailabilityLow
expr: |
(1 - sum(rate(icn_gateway_requests_total{status=~"5.."}[1h])) /
sum(rate(icn_gateway_requests_total[1h]))) < 0.999
for: 5m
labels:
severity: critical
annotations:
summary: Gateway availability below 99.9%
- alert: APILatencyHigh
expr: |
histogram_quantile(0.99,
sum(rate(icn_gateway_request_duration_seconds_bucket[5m])) by (le)
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: API p99 latency above 500ms
These SLOs are targets for pilot deployment. Production deployments may have stricter requirements.