SRE Golden Signals Monitoring: The Essential Guide

In the vast ocean of available metrics, what should you actually monitor? This question has plagued operations teams since the dawn of distributed systems.

Monitor too little and you'll miss critical issues. Monitor too much and you'll drown in data without meaningful insight.

The Site Reliability Engineering discipline provides an elegant answer: the four golden signals. Master these signals and you'll detect most issues before they significantly impact customers.

What are the Golden Signals?

The golden signals are four key metrics identified by Google's SRE practice as the most important indicators of service health. Together they answer fundamental questions about your service.

The Four Signals

Signal	Question It Answers
Latency	How long does it take to serve requests?
Traffic	How much demand is on the system?
Errors	What percentage of requests fail?
Saturation	How full is the service?

Latency

Latency measures how long it takes to service requests. This includes both successful and failed requests.

# Latency metrics to track
latency_metrics:
  - name: request_duration_seconds
    type: histogram
    percentiles:
      - p50  # Typical experience
      - p95  # Common bad experience
      - p99  # Worst experience
    labels:
      - endpoint
      - status_code

High latency degrades user experience even when services are technically functioning. Track error latency separately, as it often differs significantly from success latency.

Traffic

Traffic quantifies demand on your system. The specific metric depends on service type:

Web services - HTTP requests per second
Databases - Transactions per minute
Message queues - Messages processed per second
APIs - API calls per hour

Traffic trends reveal usage patterns and correlate with resource consumption.

Errors

Errors track the rate of failed requests. This includes:

Explicit failures - HTTP 500 responses, exceptions
Implicit failures - Slow requests missing SLO targets
Wrong content - Incorrect responses that passed validation

# Error rate calculation
error_rate:
  formula: "failed_requests / total_requests * 100"
  good_threshold: "<1%"
  alert_threshold: ">5%"

Error rates directly indicate whether the service is meeting its purpose.

Saturation

Saturation measures how full your service is. It relates to constrained resources:

CPU utilization
Memory usage
Network bandwidth
Storage capacity
Connection pools
Thread pools

Unlike other signals that measure current behavior, saturation indicates future problems. Resources approaching capacity will soon affect performance or availability.

Why Golden Signals Monitoring Matters

Golden signals monitoring provides advantages over unfocused metric collection.

Coverage Completeness

The four signals touch every dimension relevant to users:

Latency and errors measure quality of completed requests
Traffic provides demand context
Saturation reveals capacity constraints

Monitoring all four ensures no critical dimension is overlooked.

Simplicity Enables Focus

Rather than dashboards with hundreds of metrics, golden signals provide a manageable set:

┌─────────────────────────────────────────┐
│           SERVICE HEALTH                │
├──────────┬──────────┬──────────┬───────┤
│ Latency  │ Traffic  │ Errors   │ Satur.│
│  45ms    │  1.2k/s  │  0.1%    │  65%  │
│  p99     │  current │  rate    │  CPU  │
└──────────┴──────────┴──────────┴───────┘

This focus improves incident detection and reduces cognitive load.

Universal Applicability

The framework works across service types:

Web applications
Message queues
Databases
Microservices
API gateways

This consistency simplifies training and enables pattern recognition.

User Experience Alignment

Golden signals directly reflect what customers care about. Unlike internal metrics that may or may not correlate with impact, these signals measure user-facing aspects.

Root Cause Correlation

Most issues manifest in one or more signals before becoming outages. Learning to interpret patterns accelerates diagnosis:

Pattern	Likely Cause
High saturation then high latency	Resource exhaustion
High errors with normal latency	Application bug
Traffic spike then all signals degrade	Capacity exceeded
Latency increase without saturation	Dependency issue

How to Implement Golden Signals Monitoring

Implementing golden signals requires thoughtful metric selection and instrumentation.

Latency Monitoring

Measure request duration at the service boundary:

# Prometheus queries for latency
# Median latency
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))

# 95th percentile
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 99th percentile
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Key practices:

Capture distributions, not just averages
Separate latency by request type when profiles differ
Track percentiles to understand typical and worst experiences

Traffic Measurement

Count meaningful units of work:

# Requests per second
rate(http_requests_total[5m])

# By endpoint
sum by (endpoint) (rate(http_requests_total[5m]))

Compare current traffic to historical baselines to identify anomalies.

Error Monitoring

Capture all failure modes:

# Error rate as percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

Calculate error rates as percentages of total traffic. Count-based thresholds don't scale with traffic changes.

Categorize errors by type for targeted investigation.

Saturation Monitoring

Focus on constrained resources:

# CPU saturation
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory saturation
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes * 100

# Connection pool saturation
pg_stat_activity_count / pg_settings_max_connections * 100

Alert well before reaching limits. Monitor rate of change for unpredictable workloads.

Building Dashboards

Display all four signals together:

# Dashboard layout
dashboard:
  title: "Service Golden Signals"
  rows:
    - panels:
        - title: "Latency (p50, p95, p99)"
          type: graph
        - title: "Traffic (req/s)"
          type: graph
    - panels:
        - title: "Error Rate (%)"
          type: graph
        - title: "Saturation (CPU, Memory)"
          type: graph

When investigating issues, correlation between signals often reveals root cause.

Golden Signals Best Practices

Successful implementations follow proven practices.

Instrument at the Right Layer

Measure latency and errors as close to the user as possible:

Load balancers
API gateways
Edge proxies

What happens inside your service matters less than what users experience.

Set Meaningful Thresholds Based on SLOs

Connect alerting to business requirements:

slo:
  latency:
    target: "95% of requests < 200ms"
    alert_when: "p95 > 200ms for 5 minutes"

  errors:
    target: "99.9% success rate"
    alert_when: "error_rate > 0.1% for 5 minutes"

  availability:
    target: "99.95% uptime"
    alert_when: "service unavailable for 30 seconds"

Use Appropriate Aggregation Windows

Balance sensitivity with noise reduction:

Window	Use Case
1 minute	Real-time dashboards
5 minutes	Standard alerting
15 minutes	Trend analysis
1 hour	Capacity planning

Implement Anomaly Detection

Normal traffic varies by time of day, day of week, and season:

anomaly_detection:
  baseline: "same hour last week"
  threshold: "2 standard deviations"
  applies_to:
    - traffic
    - latency

Document Metric Meaning

When an alert fires, responders shouldn't need to research:

metric_documentation:
  name: http_request_duration_seconds
  description: "Time from request received to response sent"
  unit: seconds
  normal_range: "p99 < 500ms"
  runbook: "https://wiki.example.com/runbooks/latency"

Review and Tune Regularly

As services evolve, appropriate thresholds may change. Schedule periodic reviews to ensure monitoring remains aligned with current behavior.

The golden signals provide a proven framework for focusing monitoring on what matters most. By measuring latency, traffic, errors, and saturation consistently, you gain comprehensive visibility into service health.

Getting Started

Select appropriate metrics for each signal
Instrument at user-facing boundaries
Set thresholds based on SLOs
Build dashboards displaying all four signals
Train your team to interpret patterns

Golden signals are a starting point, not an endpoint. They provide the foundation for service health visibility. Use them for detection and initial triage, then drill down into detailed metrics as investigation requires.