DevOpsJanuary 9, 2026 11 min read

SRE Golden Signals Monitoring: The Essential Guide

Master the four golden signals of SRE monitoring. Learn how to measure latency, traffic, errors, and saturation for reliable service operations.

WizStatus Team
Author

In the vast ocean of available metrics, what should you actually monitor? This question has plagued operations teams since the dawn of distributed systems.

Monitor too little and you'll miss critical issues. Monitor too much and you'll drown in data without meaningful insight.

The Site Reliability Engineering discipline provides an elegant answer: the four golden signals. Master these signals and you'll detect most issues before they significantly impact customers.

What are the Golden Signals?

The golden signals are four key metrics identified by Google's SRE practice as the most important indicators of service health. Together they answer fundamental questions about your service.

The Four Signals

SignalQuestion It Answers
LatencyHow long does it take to serve requests?
TrafficHow much demand is on the system?
ErrorsWhat percentage of requests fail?
SaturationHow full is the service?

Latency

Latency measures how long it takes to service requests. This includes both successful and failed requests.

# Latency metrics to track
latency_metrics:
  - name: request_duration_seconds
    type: histogram
    percentiles:
      - p50  # Typical experience
      - p95  # Common bad experience
      - p99  # Worst experience
    labels:
      - endpoint
      - status_code
High latency degrades user experience even when services are technically functioning. Track error latency separately, as it often differs significantly from success latency.

Traffic

Traffic quantifies demand on your system. The specific metric depends on service type:

  • Web services - HTTP requests per second
  • Databases - Transactions per minute
  • Message queues - Messages processed per second
  • APIs - API calls per hour

Traffic trends reveal usage patterns and correlate with resource consumption.

Errors

Errors track the rate of failed requests. This includes:

  • Explicit failures - HTTP 500 responses, exceptions
  • Implicit failures - Slow requests missing SLO targets
  • Wrong content - Incorrect responses that passed validation
# Error rate calculation
error_rate:
  formula: "failed_requests / total_requests * 100"
  good_threshold: "<1%"
  alert_threshold: ">5%"

Error rates directly indicate whether the service is meeting its purpose.

Saturation

Saturation measures how full your service is. It relates to constrained resources:

  • CPU utilization
  • Memory usage
  • Network bandwidth
  • Storage capacity
  • Connection pools
  • Thread pools
Unlike other signals that measure current behavior, saturation indicates future problems. Resources approaching capacity will soon affect performance or availability.

Why Golden Signals Monitoring Matters

Golden signals monitoring provides advantages over unfocused metric collection.

Coverage Completeness

The four signals touch every dimension relevant to users:

  • Latency and errors measure quality of completed requests
  • Traffic provides demand context
  • Saturation reveals capacity constraints

Monitoring all four ensures no critical dimension is overlooked.

Simplicity Enables Focus

Rather than dashboards with hundreds of metrics, golden signals provide a manageable set:

┌─────────────────────────────────────────┐
│           SERVICE HEALTH                │
├──────────┬──────────┬──────────┬───────┤
│ Latency  │ Traffic  │ Errors   │ Satur.│
│  45ms    │  1.2k/s  │  0.1%    │  65%  │
│  p99     │  current │  rate    │  CPU  │
└──────────┴──────────┴──────────┴───────┘

This focus improves incident detection and reduces cognitive load.

Universal Applicability

The framework works across service types:

  • Web applications
  • Message queues
  • Databases
  • Microservices
  • API gateways

This consistency simplifies training and enables pattern recognition.

User Experience Alignment

Golden signals directly reflect what customers care about. Unlike internal metrics that may or may not correlate with impact, these signals measure user-facing aspects.

Root Cause Correlation

Most issues manifest in one or more signals before becoming outages. Learning to interpret patterns accelerates diagnosis:

PatternLikely Cause
High saturation then high latencyResource exhaustion
High errors with normal latencyApplication bug
Traffic spike then all signals degradeCapacity exceeded
Latency increase without saturationDependency issue

How to Implement Golden Signals Monitoring

Implementing golden signals requires thoughtful metric selection and instrumentation.

Latency Monitoring

Measure request duration at the service boundary:

# Prometheus queries for latency
# Median latency
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))

# 95th percentile
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 99th percentile
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Key practices:

  • Capture distributions, not just averages
  • Separate latency by request type when profiles differ
  • Track percentiles to understand typical and worst experiences

Traffic Measurement

Count meaningful units of work:

# Requests per second
rate(http_requests_total[5m])

# By endpoint
sum by (endpoint) (rate(http_requests_total[5m]))

Compare current traffic to historical baselines to identify anomalies.

Error Monitoring

Capture all failure modes:

# Error rate as percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
Calculate error rates as percentages of total traffic. Count-based thresholds don't scale with traffic changes.

Categorize errors by type for targeted investigation.

Saturation Monitoring

Focus on constrained resources:

# CPU saturation
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory saturation
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes * 100

# Connection pool saturation
pg_stat_activity_count / pg_settings_max_connections * 100

Alert well before reaching limits. Monitor rate of change for unpredictable workloads.

Building Dashboards

Display all four signals together:

# Dashboard layout
dashboard:
  title: "Service Golden Signals"
  rows:
    - panels:
        - title: "Latency (p50, p95, p99)"
          type: graph
        - title: "Traffic (req/s)"
          type: graph
    - panels:
        - title: "Error Rate (%)"
          type: graph
        - title: "Saturation (CPU, Memory)"
          type: graph

When investigating issues, correlation between signals often reveals root cause.

Golden Signals Best Practices

Successful implementations follow proven practices.

Instrument at the Right Layer

Measure latency and errors as close to the user as possible:

  • Load balancers
  • API gateways
  • Edge proxies

What happens inside your service matters less than what users experience.

Set Meaningful Thresholds Based on SLOs

Connect alerting to business requirements:

slo:
  latency:
    target: "95% of requests < 200ms"
    alert_when: "p95 > 200ms for 5 minutes"

  errors:
    target: "99.9% success rate"
    alert_when: "error_rate > 0.1% for 5 minutes"

  availability:
    target: "99.95% uptime"
    alert_when: "service unavailable for 30 seconds"

Use Appropriate Aggregation Windows

Balance sensitivity with noise reduction:

WindowUse Case
1 minuteReal-time dashboards
5 minutesStandard alerting
15 minutesTrend analysis
1 hourCapacity planning

Implement Anomaly Detection

Normal traffic varies by time of day, day of week, and season:

anomaly_detection:
  baseline: "same hour last week"
  threshold: "2 standard deviations"
  applies_to:
    - traffic
    - latency

Document Metric Meaning

When an alert fires, responders shouldn't need to research:

metric_documentation:
  name: http_request_duration_seconds
  description: "Time from request received to response sent"
  unit: seconds
  normal_range: "p99 < 500ms"
  runbook: "https://wiki.example.com/runbooks/latency"

Review and Tune Regularly

As services evolve, appropriate thresholds may change. Schedule periodic reviews to ensure monitoring remains aligned with current behavior.

Conclusion

The golden signals provide a proven framework for focusing monitoring on what matters most. By measuring latency, traffic, errors, and saturation consistently, you gain comprehensive visibility into service health.

Getting Started

  1. Select appropriate metrics for each signal
  2. Instrument at user-facing boundaries
  3. Set thresholds based on SLOs
  4. Build dashboards displaying all four signals
  5. Train your team to interpret patterns
Golden signals are a starting point, not an endpoint. They provide the foundation for service health visibility. Use them for detection and initial triage, then drill down into detailed metrics as investigation requires.

Related Articles

Alert Fatigue Prevention: Strategies for Effective Monitoring
Best Practices

Alert Fatigue Prevention: Strategies for Effective Monitoring

Combat alert fatigue with proven prevention strategies. Learn how to reduce noise, prioritize alerts, and maintain effective monitoring without overwhelming your team.
10 min read
Chaos Engineering Monitoring: Measure Resilience in Action
DevOps

Chaos Engineering Monitoring: Measure Resilience in Action

Learn to monitor chaos engineering experiments effectively. Discover metrics, observability patterns, and analysis techniques for resilience testing.
12 min read
CI/CD Pipeline Monitoring: Ensure Fast, Reliable Deployments
DevOps

CI/CD Pipeline Monitoring: Ensure Fast, Reliable Deployments

Master CI/CD pipeline monitoring for reliable software delivery. Learn key metrics, alerting strategies, and optimization techniques for deployment pipelines.
11 min read

Start monitoring your infrastructure today

Put these insights into practice with WizStatus monitoring.

Try WizStatus Free