This is sample postmortem reporting to review chronologies, provide the mitigation from the issue and solving the problem during period time
Title
- YYYY-MM-DD Issue Name.
eg:
2020-09-01 Failed to Replicate Database Slave in Node-2.
Issue Summary
- Summary of issue that describe all chronologies.
eg:
We had issue in replication slave server database in node-2. This issue running at 07:00 due to can’t connect the slave server DNS to DNS server master. Impacted to unable connected for some of microservices that using slave server as pointing reading / query read to database.
List of microservices impacted:- Microservices 1: Auth
- Microservices 2: OTP
Impact
- List of microservices or other infrastructure resources impacted for this issue.
eg:
Impacted microservices:- Microservices 1: Auth
- Microservices 2: OTP
Impacted infra:
DNS slave
Trigger
- List of trigger issue.
eg:- Cloud provider running on maintenance starting at 2020-09-01 02:00 GMT+7 and end at 2020-09-01 03:00.
- Some of DNS changed as the impacted of maintenance.
Detection
- List of detection issue.
eg:- Detect on Metrics for failed replication (with snapshot picture)
- Detect on Log for dns changes (with snapshot picture)
Root Cause
- List of root cause for the issue.
eg:- Slave server database in node-2 can’t running due to can’t connect to DNS server master.
- DNS server master had been moved to other pointing address due to cloud provider maintenance.
Timeline
- List timeline issue from beginning until end (resolved).
eg:
2020-09-01 07:00 Metrics show failed to replicate the slave server database in node-2
2020-09-01 07:10 Raise the alert on P3 Escalation
2020-09-01 07:12 Oncall ack the issue
2020-09-01 07:15 Taking action for manual replication slave server
2020-09-01 07:30 All Replication had been restored
2020-09-01 07:35 Monitoring phase replication (for about 10-15 minutes)
2020-09-01 08:00 Operation slave server database in node-2 is back to normal
Resolution & Recovery
- List of resolution & recovery action
eg:- Manual replication for slave server
- Repointing DNS slave node-2 to new DNS master
Corrective and Preventive Measurements
- List of action item / procedure to make correction & prevention (as mitigation)
eg:- Update threshold metrics for alerting, raise to P2 for escalation level.
- Raise open ticket for cloud provider dns issue moving impact.
Financial Impact
Product Impacted | Start DateTime – End DateTime | Impact Type (Outage, Error Rates, Latency Spike) |
Monitoring Links | Log Links |
---|---|---|---|---|
- Detail of Financial Impact
Division / Team Name
List of division / team which impacted for this postmortem
Related Documents
Related documentation for this issue (JIRA / Confluences)