[RFC] Postmortem Report

This is sample postmortem reporting to review chronologies, provide the mitigation from the issue and solving the problem during period time

Title

  • YYYY-MM-DD Issue Name.
    eg:
    2020-09-01 Failed to Replicate Database Slave in Node-2.

Issue Summary

  • Summary of issue that describe all chronologies.
    eg:
    We had issue in replication slave server database in node-2. This issue running at 07:00 due to can’t connect the slave server DNS to DNS server master. Impacted to unable connected for some of microservices that using slave server as pointing reading / query read to database.

    List of microservices impacted:
    • Microservices 1: Auth
    • Microservices 2: OTP

Impact

  • List of microservices or other infrastructure resources impacted for this issue.
    eg:
    Impacted microservices:
    • Microservices 1: Auth
    • Microservices 2: OTP

Impacted infra:
DNS slave

Trigger

  • List of trigger issue.
    eg:
    • Cloud provider running on maintenance starting at 2020-09-01 02:00 GMT+7 and end at 2020-09-01 03:00.
    • Some of DNS changed as the impacted of maintenance.

Detection

  • List of detection issue.
    eg:
    • Detect on Metrics for failed replication (with snapshot picture)
    • Detect on Log for dns changes (with snapshot picture)

Root Cause

  • List of root cause for the issue.
    eg:
    • Slave server database in node-2 can’t running due to can’t connect to DNS server master.
    • DNS server master had been moved to other pointing address due to cloud provider maintenance.

Timeline

  • List timeline issue from beginning until end (resolved).
    eg:
    2020-09-01 07:00 Metrics show failed to replicate the slave server database in node-2
    2020-09-01 07:10 Raise the alert on P3 Escalation
    2020-09-01 07:12 Oncall ack the issue
    2020-09-01 07:15 Taking action for manual replication slave server
    2020-09-01 07:30 All Replication had been restored
    2020-09-01 07:35 Monitoring phase replication (for about 10-15 minutes)
    2020-09-01 08:00 Operation slave server database in node-2 is back to normal

Resolution & Recovery

  • List of resolution & recovery action
    eg:
    • Manual replication for slave server
    • Repointing DNS slave node-2 to new DNS master

Corrective and Preventive Measurements

  • List of action item / procedure to make correction & prevention (as mitigation)
    eg:
    • Update threshold metrics for alerting, raise to P2 for escalation level.
    • Raise open ticket for cloud provider dns issue moving impact.

Financial Impact

Product Impacted Start DateTime – End DateTime Impact Type
(Outage, Error Rates, Latency Spike)
Monitoring Links Log Links
         
         
  • Detail of Financial Impact

Division / Team Name

List of division / team which impacted for this postmortem

Related documentation for this issue (JIRA / Confluences)

[RFC] Performance Testing K6

Monitoring Dashboard

  • Monitoring Dashboard URL

Logging

  • Logging Dashboard URL

Operations (Executors)

PIC Name Department
DevOps Engineer – 1 DevOps
DevOps Engineer – 2
QA Engineer – 1 QA
QA Engineer – 2
Software Engineer – 1 Engineering
Software Engineer – 2

Supervisors

Supervisor Name Department Remark
@zeroc0d3-devops DevOps  
@zeroc0d3-engineer Engineering  
@zeroc0d3-iot IoT  
@zeroc0d3-data Data  

HelmChart

Deployments Request Limit
CPI (mi) Mem (mb) CPU (mi) Mem (mb)
         

Performance Test Report

Cycle Virtual User (vus) Duration (seconds) Date / Time Service Component Before Inprogress After Jenkins
Link
Monitoring Link Remark (Logs)
Start End CPU (mi) Mem (mb) CPU (mi) Mem (mb) CPU (mi) Mem (mb) Performance Test Process After Performance Test Process
1         EKS <deployment-name>                    
RDS <rds-name>/<db-name>             N/A     N/A

References:

K6 Website:    https://k6.io/
K6 SourceCode: https://github.com/grafana/k6

[RFC] Helm Template

A. HelmChart Template

Prerequirements

  • Helm
### Linux ###
$ curl -fsSL -o get_helm.sh <https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3>
$ chmod 700 get_helm.sh
$ ./get_helm.sh

### MacOS ###
$ brew install helm
  • Helmfile
$ wget https://github.com/roboll/helmfile/releases/download/v0.139.7/helmfile_linux_amd64
$ chmod +x helmfile_linux_amd64
$ sudo mv helmfile_linux_amd64 /usr/local/bin/helmfile
  • Helm Plugins
$ helm plugin install https://github.com/databus23/helm-diff
$ helm plugin install https://github.com/hypnoglow/helm-s3.git
  • Added Mandatory Repository
$ helm repo add stable https://charts.helm.sh/stable
$ helm repo update

Helm Repository

  • Check Repository Helm
$ helm repo list
----
NAME            URL
stable          https://charts.helm.sh/stable
  • Adding Repository Helm
### LAB ###
AWS_REGION=ap-southeast-1 helm repo add devopscorner-lab s3://devopscorner-charts/lab

### STAGING ###
AWS_REGION=ap-southeast-1 helm repo add devopscorner-staging s3://devopscorner-charts/staging

### PRODUCTION ###
AWS_REGION=ap-southeast-1 helm repo add devopscorner s3://devopscorner-charts/prod

helm repo update

Creating HelmChart Template

$ helm create [helmchart_name]
---
eg: 
$ helm create myhelm

$ tree myhelm
myhelm
├── Chart.yaml
├── charts
├── templates
│   ├── NOTES.txt
│   ├── _helpers.tpl
│   ├── deployment.yaml
│   ├── hpa.yaml
│   ├── ingress.yaml
│   ├── service.yaml
│   ├── serviceaccount.yaml
│   └── tests
│       └── test-connection.yaml
└── values.yaml

3 directories, 10 files

Structure HelmChart Template (Multi Environment)

.
├── template
│   ├── lab
│   │   ├── api
│   │   │   ├── Chart.yaml
│   │   │   ├── api.yaml
│   │   │   ├── templates
│   │   │   │   ├── _helpers.tpl
│   │   │   │   └── serviceaccount.yaml
│   │   │   └── values.yaml
│   │   ├── backend
│   │   │   ├── Chart.yaml
│   │   │   ├── backend.yaml
│   │   │   ├── templates
│   │   │   │   ├── _helpers.tpl
│   │   │   │   └── serviceaccount.yaml
│   │   │   └── values.yaml
│   │   ├── frontend
│   │   │   ├── Chart.yaml
│   │   │   ├── frontend.yaml
│   │   │   ├── templates
│   │   │   │   ├── _helpers.tpl
│   │   │   │   └── serviceaccount.yaml
│   │   │   └── values.yaml
│   │   └── svcrole
│   │       ├── Chart.yaml
│   │       ├── templates
│   │       │   ├── _helpers.tpl
│   │       │   ├── clusterrole.yaml
│   │       │   ├── rolebinding.yaml
│   │       │   └── serviceaccount.yaml
│   │       └── values.yaml
... ... ...   
│   ├── [environment]
│   │   ├── api
│   │   │   └── values.yaml
│   │   ├── backend
│   │   │   └── values.yaml
│   │   ├── frontend
│   │   │   └── values.yaml
│   │   └── svcrole
│   │       └── values.yaml
└── test
    ├── lab
    │   ├── helmfile.yaml
    │   └── values
    │       ├── api
    │       │   └── api.yaml
    │       ├── backend
    │       │   └── backend.yaml
    │       ├── frontend
    │       │   └── frontend.yaml
    │       └── svcrole
    │           ├── account.yaml
    │           ├── api.yaml
    │           ├── backend.yaml
    │           └── frontend.yaml
    └── staging
        ├── helmfile.yaml
        └── values
            ├── api
            │   └── api.yaml
            ├── backend
            │   └── backend.yaml
            ├── frontend
            │   └── frontend.yaml
            └── svcrole
                ├── account.yaml
                ├── api.yaml
                ├── backend.yaml
                └── frontend.yaml

HelmChart In Repository

  • Structure on services repository
_infra/
   dev/
      helmfile.yaml
      values/
            api/values.yaml
            backend/values.yaml
            svcrole/values.yaml
            frontend/values.yaml

Testing Helm

  • Testing the Chart Template
helm template ./api -f values/api/values.yaml
helm template ./backend -f values/backend/values.yaml
helm template ./svcrole -f values/svcrole/values.yaml
helm template ./frontend -f values/frontend/values.yaml

Packing HelmChart

  • Create zip Packate of HelmChart
helm package api
helm package backend
helm package svcrole
helm package frontend

Update HelmChart Template

  • Push chart into private repository
### LAB ###
helm s3 push api-[version].tgz devopscorner-lab --force
helm s3 push backend-[version].tgz devopscorner-lab --force
helm s3 push frontend-[version].tgz devopscorner-lab --force
helm s3 push svcrole-[version].tgz devopscorner-lab --force
---
### STAGING ###
helm s3 push api-[version].tgz devopscorner-staging --force
helm s3 push backend-[version].tgz devopscorner-staging --force
helm s3 push frontend-[version].tgz devopscorner-staging --force
helm s3 push svcrole-[version].tgz devopscorner-staging --force
---
### PRODUCTION ###
helm s3 push api-[version].tgz devopscorner --force
helm s3 push backend-[version].tgz devopscorner --force
helm s3 push frontend-[version].tgz devopscorner --force
helm s3 push svcrole-[version].tgz devopscorner --force

B. Versioning HelmChart

  • Change Version HelmChart
$ vi api/Chart.yaml
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: "1.1.0-rc"
  • Repacking HelmChart template
  • Repush HelmChart into private repository

C. Using Versioning HelmChart in helmfile.yaml

  • Repository Lab
---
repositories:
  - name: devopscorner-lab
    url: s3://devopscorner-charts/lab

templates:
  default: &default
    namespace: devopscorner
    version: "1.0.0-rc"

releases:
  - name: devopscorner-api
    chart: devopscorner-lab/api
    values:
      - ./values/api/values.yaml
    <<: *default

  - name: devopscorner-backend
    chart: devopscorner-lab/backend
    values:
      - ./values/backend/values.yaml
    <<: *default

  - name: devopscorner-frontend
    chart: devopscorner-lab/frontend
    values:
      - ./values/frontend/values.yaml
    <<: *default

  - name: devopscorner-svcaccount
    chart: devopscorner-lab/svcrole
    values:
      - ./values/svcrole/account.yaml
    <<: *default

[RFC] Logging

A. Concepts

Standardization export log path and name

eg:
----
/var/log/[microservice-name]/[microservice-name]-error.log   # error only
/var/log/[microservice-name]/[microservice-name].log         # info, warning & debug

Log using JSON formatted

Severity logs & formatting logs

eg: INFO
---
{
  "datetime": "2020-10-10 20:01:59TZ+0700"
  "severity": "info",
  "message": "yes, this is info"
}

eg: WARNING
---
{
  "datetime": "2020-10-10 20:01:59TZ+0700"
  "severity": "warning",
  "message": "this is warning"
}

eg: ERROR
---
{
  "datetime": "2020-10-10 20:01:59TZ+0700"
  "severity": "error",
  "code": 404
  "message": "not found"
}

eg: DEBUG (optional)
---
{
  "datetime": "2020-10-10 20:01:59TZ+0700"
  "severity": "debug",
  "code": 100
  "message": "describe debug information (criteria by number) "
}

Logrotation & compression

# /etc/logrotate.d/[microservice-name]
---
/var/log/[microservice-name]/[microservice-name].log {
        rotate 12
        weekly
        missingok
        notifempty
        compress
        delaycompress
        size 50M
        notifempty
        sharedscripts
        postrotate
           /usr/bin/killall -HUP [microservice-name]
        endscript
}

/var/log/[microservice-name]/[microservice-name]-error.log {
        rotate 12
        weekly
        missingok
        notifempty
        compress
        delaycompress
        size 50M
        notifempty
        sharedscripts
        postrotate
           /usr/bin/killall -HUP [microservice-name]
        endscript
}

Log4j (JAVA)

# log4j.properties
---
log4j.rootLogger=INFO, fileLogger
log4j.appender.fileLogger=org.apache.log4j.RollingFileAppender
log4j.appender.fileLogger.layout=org.apache.log4j.PatternLayout
log4j.appender.fileLogger.layout.ConversionPattern=%d [%t] %-5p (%F:%L) - %m%n
log4j.appender.fileLogger.File=example.log
log4j.appender.fileLogger.MaxFileSize=50MB
log4j.appender.fileLogger.MaxBackupIndex=12

Schedule logging (log exporter)

  • Schedule with cron (crontab)
/etc/cron.d/[microservice-name]
  • Schedule with systemd
/etc/systemd/system/[microservice-name].service
/etc/systemd/system/[microservice-name].timer

B. Tools

  • GO

https://github.com/sirupsen/logrus

  • Python
from datetime import datetime
import logging
import time
import json
       
def main():
    print("--- Staring Log Exporter Agent ---")
    logging.basicConfig(level=logging.INFO, filename="/var/log/[microservice-name]/[microservice-name].log", format="%(message)s")

if __name__ == '__main__':
    main()

[RFC] Load Testing

A. Overview

Load Testing / Stress Test is a mechanism for flooding the network with a certain requests during some period time.

To have successfully loadtest we need:

  • Preparation
  • How many request will be use ?
  • How much time execute this loadtesting ?
  • How many microservices involved ?
  • Where is the environment from this loadtest will be impact (onprem, staging, production) ?
  • Cost effective estimation for this loadtesting event ?

B. Technical Documentation

Define your technical documentation for this loadtest (estimation and after loadtesting event)

  • Environment
  • Number of Microservices
  • Number of Request
  • Period Time (Begin – End Snapshot Datetime)
  • Target Metrics
  • Snapshot Result (Metrics Dashboard & Logs)

[RFC] Container Application

Technical documentation for standardization build image application.

Define Core Image as References from sub / child container

  • Ubuntu
# ubuntu-base
FROM ubuntu 20.04
  • Alpine
# alpine-base
FROM alpine 3.15
  • Debian
# debian-base
FROM debian:buster

Docker Meta Data

ARG BUILD_DATE
ARG BUILD_VERSION
ARG GIT_COMMIT
ARG GIT_URL

ENV VENDOR="DevOpsCornerID"
ENV AUTHOR="devopscornerid <support@devopscorner.id>"
ENV IMG_NAME="core-ubuntu"
ENV IMG_VERSION="20.04"
ENV IMG_DESC="core-ubuntu image"
ENV IMG_ARCH="amd64/x86_64"

## Simple Description ##
LABEL maintainer="$AUTHOR" \
      architecture="$IMG_ARCH" \
      ubuntu-version="$IMG_VERSION" \
      org.label-schema.build-date="$BUILD_DATE" \
      org.label-schema.name="$IMG_NAME" \
      org.label-schema.description="$IMG_DESC" \
      org.label-schema.vcs-ref="$GIT_COMMIT" \
      org.label-schema.vcs-url="$GIT_URL" \
      org.label-schema.vendor="$VENDOR" \
      org.label-schema.version="$BUILD_VERSION" \
      org.label-schema.schema-version="$IMG_VERSION" \
      
## Additional Detail Description ##
      org.opencontainers.image.authors="$AUTHOR" \
      org.opencontainers.image.description="$IMG_DESC" \
      org.opencontainers.image.vendor="$VENDOR" \
      org.opencontainers.image.version="$IMG_VERSION" \
      org.opencontainers.image.revision="$GIT_COMMIT" \
      org.opencontainers.image.created="$BUILD_DATE" \
      fr.hbis.docker.base.build-date="$BUILD_DATE" \
      fr.hbis.docker.base.name="$IMG_NAME" \
      fr.hbis.docker.base.vendor="$VENDOR" \
      fr.hbis.docker.base.version="$BUILD_VERSION"
    
-----
eg (detail):
LABEL maintainer="$AUTHOR" \
      architecture="$IMG_ARCH" \
      ubuntu-version="$IMG_VERSION" \
      org.label-schema.build-date="$BUILD_DATE" \
      org.label-schema.name="$IMG_NAME" \
      org.label-schema.description="$IMG_DESC" \
      org.label-schema.vcs-ref="$GIT_COMMIT" \
      org.label-schema.vcs-url="$GIT_URL" \
      org.label-schema.vendor="$VENDOR" \
      org.label-schema.version="$BUILD_VERSION" \
      org.label-schema.schema-version="$IMG_VERSION" \
      org.opencontainers.image.authors="$AUTHOR" \
      org.opencontainers.image.description="$IMG_DESC" \
      org.opencontainers.image.vendor="$VENDOR" \
      org.opencontainers.image.version="$IMG_VERSION" \
      org.opencontainers.image.revision="$GIT_COMMIT" \
      org.opencontainers.image.created="$BUILD_DATE" \
      fr.hbis.docker.base.build-date="$BUILD_DATE" \
      fr.hbis.docker.base.name="$IMG_NAME" \
      fr.hbis.docker.base.vendor="$VENDOR" \
      fr.hbis.docker.base.version="$BUILD_VERSION"

Install Container with no cache

  • Ubuntu / Debian
RUN apt -o APT::Sandbox::User=root update; sync
RUN apt-get update; sync
RUN apt-get install -y [packages]
  • Alpine
RUN apk add --no-cache --update [packages]

Remove unused packages from sub / child container that never use inside container

  • Ubuntu / Debian
RUN apt-get remove [packages]; sync
RUN apt-get clean; sync
  • Alpine
RUN rm -rf /var/cache/apk/*

Use sub / child docker container for installation library and for better compression size use cascade builder images

  • From Core Image or Parent
FROM ubuntu-core:latest
FROM alpine-core:latest
FROM debian-core:latest
  • Cascade (Multistage) Builder
### Builder ###
FROM golang:1.17-alpine3.15 as builder

WORKDIR /go/src/app
ENV GIN_MODE=release
ENV GOPATH=/go

RUN apk add --no-cache \
        build-base \
        git \
        curl \
        make \
        bash

RUN git clone https://github.com/zeroc0d3/go-bookstore.git /go/src/app

RUN GOOS=linux GOARCH=amd64 CGO_ENABLED=0 \
    cd /go/src/app && \
        go build -mod=readonly -ldflags="-s -w" -o goapp

### Binary ###
FROM golang:1.17-alpine3.15

ENV GIN_MODE=release
COPY --from=builder /go/src/app/goapp /usr/local/bin/goapp

ENTRYPOINT ["/usr/local/bin/goapp"]
EXPOSE 8080