[RFC] Postmortem Report

This is sample postmortem reporting to review chronologies, provide the mitigation from the issue and solving the problem during period time

Title

  • YYYY-MM-DD Issue Name.
    eg:
    2020-09-01 Failed to Replicate Database Slave in Node-2.

Issue Summary

  • Summary of issue that describe all chronologies.
    eg:
    We had issue in replication slave server database in node-2. This issue running at 07:00 due to can’t connect the slave server DNS to DNS server master. Impacted to unable connected for some of microservices that using slave server as pointing reading / query read to database.

    List of microservices impacted:
    • Microservices 1: Auth
    • Microservices 2: OTP

Impact

  • List of microservices or other infrastructure resources impacted for this issue.
    eg:
    Impacted microservices:
    • Microservices 1: Auth
    • Microservices 2: OTP

Impacted infra:
DNS slave

Trigger

  • List of trigger issue.
    eg:
    • Cloud provider running on maintenance starting at 2020-09-01 02:00 GMT+7 and end at 2020-09-01 03:00.
    • Some of DNS changed as the impacted of maintenance.

Detection

  • List of detection issue.
    eg:
    • Detect on Metrics for failed replication (with snapshot picture)
    • Detect on Log for dns changes (with snapshot picture)

Root Cause

  • List of root cause for the issue.
    eg:
    • Slave server database in node-2 can’t running due to can’t connect to DNS server master.
    • DNS server master had been moved to other pointing address due to cloud provider maintenance.

Timeline

  • List timeline issue from beginning until end (resolved).
    eg:
    2020-09-01 07:00 Metrics show failed to replicate the slave server database in node-2
    2020-09-01 07:10 Raise the alert on P3 Escalation
    2020-09-01 07:12 Oncall ack the issue
    2020-09-01 07:15 Taking action for manual replication slave server
    2020-09-01 07:30 All Replication had been restored
    2020-09-01 07:35 Monitoring phase replication (for about 10-15 minutes)
    2020-09-01 08:00 Operation slave server database in node-2 is back to normal

Resolution & Recovery

  • List of resolution & recovery action
    eg:
    • Manual replication for slave server
    • Repointing DNS slave node-2 to new DNS master

Corrective and Preventive Measurements

  • List of action item / procedure to make correction & prevention (as mitigation)
    eg:
    • Update threshold metrics for alerting, raise to P2 for escalation level.
    • Raise open ticket for cloud provider dns issue moving impact.

Financial Impact

Product Impacted Start DateTime – End DateTime Impact Type
(Outage, Error Rates, Latency Spike)
Monitoring Links Log Links
         
         
  • Detail of Financial Impact

Division / Team Name

List of division / team which impacted for this postmortem

Related documentation for this issue (JIRA / Confluences)