In the vast ocean of available metrics, what should you actually monitor? This question has plagued operations teams since the dawn of distributed systems.
Monitor too little and you'll miss critical issues. Monitor too much and you'll drown in data without meaningful insight.
What are the Golden Signals?
The golden signals are four key metrics identified by Google's SRE practice as the most important indicators of service health. Together they answer fundamental questions about your service.
The Four Signals
| Signal | Question It Answers |
|---|---|
| Latency | How long does it take to serve requests? |
| Traffic | How much demand is on the system? |
| Errors | What percentage of requests fail? |
| Saturation | How full is the service? |
Latency
Latency measures how long it takes to service requests. This includes both successful and failed requests.
# Latency metrics to track
latency_metrics:
- name: request_duration_seconds
type: histogram
percentiles:
- p50 # Typical experience
- p95 # Common bad experience
- p99 # Worst experience
labels:
- endpoint
- status_code
Traffic
Traffic quantifies demand on your system. The specific metric depends on service type:
- Web services - HTTP requests per second
- Databases - Transactions per minute
- Message queues - Messages processed per second
- APIs - API calls per hour
Traffic trends reveal usage patterns and correlate with resource consumption.
Errors
Errors track the rate of failed requests. This includes:
- Explicit failures - HTTP 500 responses, exceptions
- Implicit failures - Slow requests missing SLO targets
- Wrong content - Incorrect responses that passed validation
# Error rate calculation
error_rate:
formula: "failed_requests / total_requests * 100"
good_threshold: "<1%"
alert_threshold: ">5%"
Error rates directly indicate whether the service is meeting its purpose.
Saturation
Saturation measures how full your service is. It relates to constrained resources:
- CPU utilization
- Memory usage
- Network bandwidth
- Storage capacity
- Connection pools
- Thread pools
Why Golden Signals Monitoring Matters
Golden signals monitoring provides advantages over unfocused metric collection.
Coverage Completeness
The four signals touch every dimension relevant to users:
- Latency and errors measure quality of completed requests
- Traffic provides demand context
- Saturation reveals capacity constraints
Monitoring all four ensures no critical dimension is overlooked.
Simplicity Enables Focus
Rather than dashboards with hundreds of metrics, golden signals provide a manageable set:
┌─────────────────────────────────────────┐
│ SERVICE HEALTH │
├──────────┬──────────┬──────────┬───────┤
│ Latency │ Traffic │ Errors │ Satur.│
│ 45ms │ 1.2k/s │ 0.1% │ 65% │
│ p99 │ current │ rate │ CPU │
└──────────┴──────────┴──────────┴───────┘
This focus improves incident detection and reduces cognitive load.
Universal Applicability
The framework works across service types:
- Web applications
- Message queues
- Databases
- Microservices
- API gateways
This consistency simplifies training and enables pattern recognition.
User Experience Alignment
Golden signals directly reflect what customers care about. Unlike internal metrics that may or may not correlate with impact, these signals measure user-facing aspects.
Root Cause Correlation
Most issues manifest in one or more signals before becoming outages. Learning to interpret patterns accelerates diagnosis:
| Pattern | Likely Cause |
|---|---|
| High saturation then high latency | Resource exhaustion |
| High errors with normal latency | Application bug |
| Traffic spike then all signals degrade | Capacity exceeded |
| Latency increase without saturation | Dependency issue |
How to Implement Golden Signals Monitoring
Implementing golden signals requires thoughtful metric selection and instrumentation.
Latency Monitoring
Measure request duration at the service boundary:
# Prometheus queries for latency
# Median latency
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))
# 95th percentile
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# 99th percentile
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Key practices:
- Capture distributions, not just averages
- Separate latency by request type when profiles differ
- Track percentiles to understand typical and worst experiences
Traffic Measurement
Count meaningful units of work:
# Requests per second
rate(http_requests_total[5m])
# By endpoint
sum by (endpoint) (rate(http_requests_total[5m]))
Compare current traffic to historical baselines to identify anomalies.
Error Monitoring
Capture all failure modes:
# Error rate as percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
Categorize errors by type for targeted investigation.
Saturation Monitoring
Focus on constrained resources:
# CPU saturation
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory saturation
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes * 100
# Connection pool saturation
pg_stat_activity_count / pg_settings_max_connections * 100
Alert well before reaching limits. Monitor rate of change for unpredictable workloads.
Building Dashboards
Display all four signals together:
# Dashboard layout
dashboard:
title: "Service Golden Signals"
rows:
- panels:
- title: "Latency (p50, p95, p99)"
type: graph
- title: "Traffic (req/s)"
type: graph
- panels:
- title: "Error Rate (%)"
type: graph
- title: "Saturation (CPU, Memory)"
type: graph
When investigating issues, correlation between signals often reveals root cause.
Golden Signals Best Practices
Successful implementations follow proven practices.
Instrument at the Right Layer
Measure latency and errors as close to the user as possible:
- Load balancers
- API gateways
- Edge proxies
What happens inside your service matters less than what users experience.
Set Meaningful Thresholds Based on SLOs
Connect alerting to business requirements:
slo:
latency:
target: "95% of requests < 200ms"
alert_when: "p95 > 200ms for 5 minutes"
errors:
target: "99.9% success rate"
alert_when: "error_rate > 0.1% for 5 minutes"
availability:
target: "99.95% uptime"
alert_when: "service unavailable for 30 seconds"
Use Appropriate Aggregation Windows
Balance sensitivity with noise reduction:
| Window | Use Case |
|---|---|
| 1 minute | Real-time dashboards |
| 5 minutes | Standard alerting |
| 15 minutes | Trend analysis |
| 1 hour | Capacity planning |
Implement Anomaly Detection
Normal traffic varies by time of day, day of week, and season:
anomaly_detection:
baseline: "same hour last week"
threshold: "2 standard deviations"
applies_to:
- traffic
- latency
Document Metric Meaning
When an alert fires, responders shouldn't need to research:
metric_documentation:
name: http_request_duration_seconds
description: "Time from request received to response sent"
unit: seconds
normal_range: "p99 < 500ms"
runbook: "https://wiki.example.com/runbooks/latency"
Review and Tune Regularly
As services evolve, appropriate thresholds may change. Schedule periodic reviews to ensure monitoring remains aligned with current behavior.
Conclusion
The golden signals provide a proven framework for focusing monitoring on what matters most. By measuring latency, traffic, errors, and saturation consistently, you gain comprehensive visibility into service health.
Getting Started
- Select appropriate metrics for each signal
- Instrument at user-facing boundaries
- Set thresholds based on SLOs
- Build dashboards displaying all four signals
- Train your team to interpret patterns