Incidents are inevitable in complex systems. What separates high-performing organizations from the rest is not the absence of incidents but what they learn from each one.
Effective postmortems transform painful experiences into organizational knowledge that prevents recurrence and improves resilience.
What is a Postmortem?
A postmortem, also called an incident review or retrospective, is a structured analysis conducted after an incident. It examines what happened, why it happened, and how to prevent similar incidents.
Postmortem Components
A complete postmortem includes:
- Timeline - Sequence of events from trigger through resolution
- Root cause analysis - Technical and procedural factors
- Impact assessment - Effect on customers, revenue, operations
- Contributing factors - Systemic issues increasing likelihood or severity
- Action items - Specific improvements to prevent recurrence
Blameless Analysis
The goal is creating an environment where people feel safe sharing what actually happened, including their own mistakes.
Scaling Formality
Postmortems vary based on incident severity:
| Severity | Postmortem Format |
|---|---|
| Minor | Brief written summary, team discussion |
| Moderate | Standard template, team review meeting |
| Major | Extensive analysis, cross-team involvement |
| Critical | Formal review, leadership participation |
Why Postmortems Matter
Postmortems serve multiple critical functions that justify the investment of time and attention.
Preventing Recurrence
When teams understand why something failed and implement improvements, they reduce the likelihood of the same issue recurring.
Distributing Knowledge
Incident response often concentrates expertise in whoever was involved. Postmortems spread that learning across the team:
- Team members who weren't present learn from the experience
- Written documents become searchable organizational memory
- Future incidents can reference past solutions
Building System Understanding
Complex systems often fail in unexpected ways. Postmortem investigation uncovers hidden dynamics:
- Unknown interactions between components
- Undocumented dependencies
- Gaps between expected and actual behavior
Cultural Benefits
This shifts organizations from reactive to proactive.
Driving Accountability
Without structured follow-up, good intentions after incidents rarely translate to actual changes. Postmortems create documented commitments that can be tracked.
How to Conduct Effective Postmortems
Effective postmortems follow a structured process that balances thoroughness with practicality.
Step 1: Schedule Promptly
Hold the review within 1-3 days of resolution:
- Memories are still fresh
- Participants have had time to decompress
- Momentum for improvement remains high
Designate a facilitator who wasn't directly involved to maintain objectivity.
Step 2: Gather Information
Before the meeting, collect factual data:
pre_meeting_preparation:
collect:
- logs_and_metrics
- alerting_timeline
- chat_transcripts
- deployment_history
build:
- preliminary_timeline
identify:
- all_incident_participants
- relevant_stakeholders
Step 3: Structure the Meeting
Focus discussion on key areas:
- Walk through the timeline - Build shared understanding
- Discuss detection - How did we learn about the issue?
- Explore response - What worked? What didn't?
- Identify contributing factors - No blame, focus on systems
- Highlight what went well - Reinforce good practices
- Brainstorm improvements - Generate action items
Step 4: Facilitate Blameless Discussion
Replace blame-focused questions with system-focused alternatives:
| Instead of... | Ask... |
|---|---|
| "Why didn't you notice the alert?" | "What about our alerting made this easy to miss?" |
| "Who approved this deployment?" | "What gaps in our deployment process allowed this?" |
| "Why was this code merged?" | "How could our review process catch this?" |
Step 5: Document Findings
Use a consistent template:
# Postmortem: [Incident Title]
## Summary
[2-3 sentence overview]
## Impact
- Duration: [X hours]
- Users affected: [X%]
- Revenue impact: [$X]
## Timeline
| Time | Event |
|------|-------|
| 10:00 | Deployment started |
| 10:15 | Errors began appearing |
| 10:20 | Alert fired |
| 10:25 | Incident declared |
| 11:00 | Root cause identified |
| 11:30 | Rollback completed |
## Root Cause
[Technical explanation]
## Contributing Factors
- [Factor 1]
- [Factor 2]
## What Went Well
- [Positive 1]
- [Positive 2]
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add monitoring for X | @alice | 2026-01-15 | Open |
| Update runbook for Y | @bob | 2026-01-20 | Open |
## Lessons Learned
- [Lesson 1]
- [Lesson 2]
Complete Postmortem Template
Here is a comprehensive template you can copy directly into your incident management system:
# Postmortem: [Incident Title]
# Date: [YYYY-MM-DD]
## Metadata
- **Severity**: [SEV1 / SEV2 / SEV3 / SEV4]
- **Duration**: [Start time] — [End time] ([X]h [Y]min)
- **Author**: [Name]
- **Postmortem date**: [YYYY-MM-DD]
- **Status**: [Draft / In Review / Final]
- **Participants**: [List of names]
## Executive Summary
[2-3 sentences: what happened, how long, who was affected, how it was resolved.]
## Impact
- **Users affected**: [Number or percentage]
- **Failed requests / transactions**: [Count]
- **Estimated revenue loss**: [$X]
- **SLA/SLO breached**: [Yes/No — specify which]
- **Support tickets opened**: [Count]
- **Public communication sent**: [Yes/No]
## Timeline (UTC)
| Time | Event | Source |
|------|-------|--------|
| HH:MM | [First sign of trouble] | [Monitoring / Customer report / Deploy log] |
| HH:MM | [Alert fired] | [PagerDuty / Grafana / Datadog] |
| HH:MM | [Incident declared — IC assigned] | [Slack / Incident tool] |
| HH:MM | [Investigation started] | [Responder] |
| HH:MM | [Root cause identified] | [Responder] |
| HH:MM | [Mitigation applied] | [Responder] |
| HH:MM | [Service restored] | [Monitoring] |
| HH:MM | [Incident closed] | [IC] |
## Detection
- **How was the incident detected?** [Alert / Customer report / Internal observation]
- **Time to detect (TTD)**: [X minutes]
- **Were existing alerts effective?** [Yes/No — explain]
## Root Cause
[Detailed technical explanation of what caused the incident. Be specific about the
failure mechanism, not just the symptom.]
## Contributing Factors
1. [Factor 1 — e.g., missing test coverage for edge case]
2. [Factor 2 — e.g., no circuit breaker on downstream dependency]
3. [Factor 3 — e.g., runbook was outdated]
## What Went Well
- [Positive 1 — e.g., alert fired within 2 minutes]
- [Positive 2 — e.g., team coordinated effectively in incident channel]
- [Positive 3 — e.g., rollback procedure worked as documented]
## What Went Poorly
- [Negative 1 — e.g., took 30 min to identify root cause]
- [Negative 2 — e.g., no runbook for this failure mode]
## Action Items
| ID | Action | Owner | Due Date | Priority | Status |
|----|--------|-------|----------|----------|--------|
| 1 | [Specific, measurable action] | @owner | YYYY-MM-DD | P1 | Open |
| 2 | [Specific, measurable action] | @owner | YYYY-MM-DD | P2 | Open |
| 3 | [Specific, measurable action] | @owner | YYYY-MM-DD | P2 | Open |
## Lessons Learned
- [Key insight 1]
- [Key insight 2]
## Appendix
- [Link to Grafana dashboard snapshot]
- [Link to relevant Slack thread]
- [Link to deploy log]
Real-World Example: Database Connection Pool Exhaustion
Below is a filled-in postmortem based on a realistic (but fictional) production incident. Use it as a reference when writing your own.
# Postmortem: Primary Database Connection Pool Exhaustion
# Date: 2026-02-12
## Metadata
- **Severity**: SEV1
- **Duration**: 09:47 — 11:32 UTC (1h 45min)
- **Author**: Sarah Chen (SRE)
- **Postmortem date**: 2026-02-14
- **Status**: Final
- **Participants**: Sarah Chen, Marco Rossi, Priya Patel, James O'Brien
## Executive Summary
A slow query introduced in deploy v4.12.0 caused the primary PostgreSQL
connection pool to saturate within 20 minutes of rollout. All API endpoints
returned 503 errors for 1 hour and 45 minutes, affecting 100% of users.
Service was restored by rolling back the deployment and applying an
emergency index.
## Impact
- **Users affected**: 42,000 (100% of active users)
- **Failed requests**: 1.2 million API calls returned 503
- **Estimated revenue loss**: $38,000
- **SLA/SLO breached**: Yes — monthly uptime SLO (99.9%) consumed
- **Support tickets opened**: 312
- **Public communication sent**: Yes — status page updated at 10:05 UTC
## Timeline (UTC)
| Time | Event | Source |
|-------|-------|--------|
| 09:32 | Deploy v4.12.0 rolled out to production | ArgoCD |
| 09:47 | Connection pool usage crosses 80% threshold | Grafana alert |
| 09:48 | PagerDuty alert fires — Marco Rossi acknowledges | PagerDuty |
| 09:52 | First 503 errors reported by external monitoring | WizStatus |
| 09:55 | Incident declared SEV1 — Sarah Chen assigned as IC | Slack |
| 10:05 | Status page updated — investigating | StatusPage |
| 10:12 | Correlation with v4.12.0 deploy identified | Investigation |
| 10:18 | Slow query identified: new endpoint scans 4M-row table | pg_stat |
| 10:25 | Rollback to v4.11.3 initiated | ArgoCD |
| 10:41 | Rollback complete — pool still draining (long-running queries) | Grafana |
| 10:55 | DBA terminates stuck queries manually | psql |
| 11:02 | Connection pool returns to normal levels | Grafana |
| 11:15 | Full API health confirmed across all regions | WizStatus |
| 11:32 | Incident closed | Slack |
## Detection
- **How was the incident detected?** Automated alert on connection pool > 80%
- **Time to detect (TTD)**: 15 minutes after deploy
- **Were existing alerts effective?** Partially — pool alert fired but there
was no alert on query duration p99, which would have triggered 5 min earlier
## Root Cause
The v4.12.0 release introduced a new `/api/reports/export` endpoint that
executed a full table scan on the `transactions` table (4.2M rows) without
an index on the `(account_id, created_at)` filter columns. Each request
held a connection for 12-45 seconds instead of the typical 50ms. With
normal traffic of ~200 RPM to this endpoint, the 50-connection pool was
exhausted within 20 minutes.
## Contributing Factors
1. **No query review in CI** — The ORM-generated query was not analyzed for
missing indexes before merge
2. **No per-query timeout** — Database connections had no statement timeout,
allowing 45-second queries to hold connections indefinitely
3. **Staging data volume mismatch** — Staging had only 12,000 rows in
`transactions`, making the scan appear fast in pre-production testing
4. **Rollback delayed by 15 min** — Team initially investigated before
deciding to rollback
## What Went Well
- Connection pool alert fired within 15 min of deploy
- Incident commander was assigned quickly (3 min after first alert)
- Status page was updated within 10 min of incident declaration
- Team identified the root cause query within 23 min
## What Went Poorly
- 15 minutes spent investigating before attempting rollback
- No query performance gate in CI to catch full table scans
- Stuck long-running queries required manual DBA intervention after rollback
- Staging environment did not surface the issue due to small dataset
## Action Items
| ID | Action | Owner | Due Date | Priority | Status |
|----|--------|-------|----------|----------|--------|
| 1 | Add `statement_timeout = 5s` to application DB config | @marco | 2026-02-19 | P1 | Done |
| 2 | Add CI check for queries without index on tables > 100K rows | @priya | 2026-02-26 | P1 | In Progress |
| 3 | Seed staging `transactions` with production-scale data (4M rows) | @james | 2026-03-05 | P2 | Open |
| 4 | Add p99 query duration alert (> 2s triggers warning) | @sarah | 2026-02-21 | P1 | Done |
| 5 | Document "rollback first" policy for connection pool incidents | @sarah | 2026-02-21 | P2 | Done |
| 6 | Add connection pool queue depth metric and alert | @marco | 2026-03-05 | P2 | Open |
## Lessons Learned
- Rollback should be the default first action for connection-pool incidents;
investigating while the pool is saturated extends the outage for all users
- ORM-generated queries can hide performance problems that only appear at
production data volumes; CI-level query analysis is needed
- Statement timeouts are a critical safety net — without them, a single slow
query type can cascade into a full outage
## Appendix
- [Grafana connection pool dashboard](https://grafana.internal/d/pg-pool)
- [Slack incident thread](https://slack.com/archives/C0123/p1707731700)
- [ArgoCD deploy log v4.12.0](https://argocd.internal/app/api/v4.12.0)
Step 6: Track Action Items
Ensure follow-through:
- Assign owners and due dates to every action
- Review open items regularly
- Escalate when progress stalls
- Measure implementation rates
Postmortem Format Comparison
Different organizations have popularized different postmortem formats. Here is how they compare:
| Aspect | Google SRE | PagerDuty | Atlassian | Custom (Our Template) |
|---|---|---|---|---|
| Origin | Google's SRE book (2016) | PagerDuty incident response docs | Atlassian Incident Handbook | Adapted from industry best practices |
| Blameless emphasis | Core principle — explicit blamelessness section | Strong — built into process | Moderate — mentioned but not central | Strong — integrated throughout |
| Timeline format | Detailed chronological log | Structured with detection/response phases | Simplified timeline | Detailed with source attribution |
| Root cause method | "What happened" narrative | 5 Whys encouraged | Fishbone / Ishikawa diagrams | Flexible — 5 Whys or narrative |
| Impact metrics | User-minutes, error budget consumption | Business impact focus | SLA/SLO tracking | Revenue, users, SLA combined |
| Action tracking | Bug/ticket references | Integrated with PagerDuty workflows | Jira issue linking | Owner + due date + priority table |
| Review meeting | Required — formal review with stakeholders | Recommended — async-friendly | Required — team retrospective | Recommended within 48h |
| Severity threshold | User-facing impact or error budget burn | Any P1/P2 incident | Configurable per team | Any incident > 15 min or > 1% users |
| Strengths | Battle-tested at massive scale, excellent narrative structure | Well-integrated with alerting workflows, async-friendly | Familiar to Jira-using teams, template library | Comprehensive, copy-paste ready, adaptable |
| Weaknesses | Can be heavyweight for small incidents | Tied to PagerDuty ecosystem | Less prescriptive on blamelessness | Requires discipline to fill completely |
| Best for | Large SRE organizations | Teams already using PagerDuty | Atlassian-ecosystem shops | Teams wanting a standalone, complete template |
Postmortem Best Practices
Organizations with mature practices follow several principles.
Enforce Blamelessness Rigorously
This requires active effort:
- No discipline for honest mistakes
- No public criticism of individuals
- Active redirection when discussion moves toward blame
- Leadership must model this behavior consistently
Use Standardized Templates
Templates ensure comprehensive coverage:
- Required sections prevent overlooking important elements
- Consistent format reduces creation effort
- Standard structure makes documents searchable
Involve All Relevant Participants
Include diverse perspectives:
- Responders who handled the incident
- Subject matter experts
- Stakeholders affected by the incident
- Anyone whose insight might reveal important factors
Set Realistic Timelines
Overly ambitious commitments lead to delays:
# Good: Achievable commitments
action_items:
- action: "Add alerting for connection pool exhaustion"
timeline: "2 weeks"
complexity: "low"
- action: "Refactor database connection handling"
timeline: "6 weeks"
complexity: "high"
Better to deliver achievable improvements than to promise transformational changes that never happen.
Track Patterns Across Postmortems
Individual incidents reveal specific failures. Patterns reveal systemic issues:
quarterly_review:
common_factors:
- "Deployment without adequate testing": 5 incidents
- "Missing monitoring": 4 incidents
- "Unclear runbooks": 3 incidents
recommended_investments:
- "Improve staging environment parity"
- "Expand monitoring coverage"
Celebrate Postmortem Quality
Recognition reinforces desired behavior:
- Highlight exemplary postmortems as learning resources
- Recognize thorough analysis and meaningful improvements
- Track and share postmortem completion rates
Conclusion
Postmortems transform incidents from painful experiences into organizational assets. By conducting blameless analysis, documenting findings consistently, and following through on action items, teams build collective knowledge.
Getting Started
- Establish a postmortem culture that values honesty over blame
- Implement templates that make postmortems efficient
- Track action items rigorously
- Review patterns across postmortems periodically
Related Guides
- Incident Management Best Practices — Establish the incident process that feeds into your postmortems
- On-Call Rotation Setup — Define the on-call structure that first responds to incidents
- Runbook Automation Guide — Automate remediation steps identified in postmortem action items
- Website Downtime Cost Calculator — Quantify the business impact documented in your postmortems