As an integral part of the DevOps culture, Cost Monitoring & Optimization is the most important element in monitoring and optimizing the use of infrastructure, especially in today’s cloud computing era. In this event, we will discuss the strategy of cost monitoring & optimization of infrastructure in using Kubernetes (EKS) on AWS.
In this session, we will discuss provisioning estimation costs, autoscaling systems, downscale schedules, and alerting systems for cost usage notifications from cost limitation budgets.
Don’t miss ZX Talk – Infrastructure Kubernetes (EKS) Cost Monitoring & Optimization which will be held on:
Date: Thursday, 23 June 2022 Time: 14.00 – 15.30 (2 – 3.30 pm) Jakarta Place: Virtual Meet
By default, API Gateway limits the steady-state requests per second (rps) across all APIs within an AWS account, per Region. It also limits the burst (that is, the maximum bucket size) across all APIs within an AWS account, per Region. In API Gateway, the burst limit corresponds to the maximum number of concurrent request submissions that API Gateway can fulfill at any moment without returning 429 Too Many Requests error responses. For more information on throttling quotas, see Amazon API Gateway quotas and important notes.
To help understand these throttling limits, here are a few examples, given a burst limit of 5,000 and an account-level rate limit of 10,000 requests per second in the Region:
If a caller submits 10,000 requests in a one-second period evenly (for example, 10 requests every millisecond), API Gateway processes all requests without dropping any.
If the caller sends 10,000 requests in the first millisecond, API Gateway serves 5,000 of those requests and throttles the rest in the one-second period.
If the caller submits 5,000 requests in the first millisecond and then evenly spreads another 5,000 requests through the remaining 999 milliseconds (for example, about 5 requests every millisecond), API Gateway processes all 10,000 requests in the one-second period without returning 429 Too Many Requests error responses.
If the caller submits 5,000 requests in the first millisecond and waits until the 101st millisecond to submit another 5,000 requests, API Gateway processes 6,000 requests and throttles the rest in the one-second period. This is because at the rate of 10,000 rps, API Gateway has served 1,000 requests after the first 100 milliseconds and thus emptied the bucket by the same amount. Of the next spike of 5,000 requests, 1,000 fill the bucket and are queued to be processed. The other 4,000 exceed the bucket capacity and are discarded.
If the caller submits 5,000 requests in the first millisecond, submits 1,000 requests at the 101st millisecond, and then evenly spreads another 4,000 requests through the remaining 899 milliseconds, API Gateway processes all 10,000 requests in the one-second period without throttling.
All of the API endpoints are rate limited. Once you exceed a certain number of requests in a specific period, Datadog returns an error.
If you are rate limited, you will see a 429 in the response code. Datadog recommends to either wait the time designated by the X-RateLimit-Limit before making calls again, or you should switch to making calls at a frequency slightly longer than the X-RateLimit-Limit / X-RateLimit-Period.
Datadog does not rate limit on data point/metric submission (see metrics section for more info on how the metric submission rate is handled). Limits encounter is dependent on the quantity of custom metrics based on your agreement.
The rate limit for metric retrieval is 100 per hour per organization.
The rate limit for event submission is 500,000 events per hour per organization.
The rate limit for event aggregation is 1000 per aggregate per day per organization. An aggregate is a group of similar events.
The rate limit for the Query a Timeseries API call is 1600 per hour per organization. This can be extended on demand.
The rate limit for the Log Query API call is 300 per hour per organization. This can be extended on demand.
The rate limit for the Graph a Snapshot API call is 60 per hour per organization. This can be extended on demand.
The rate limit for the Log Configuration API is 6000 per minute per organization. This can be extended on demand.
Rate Limit Headers
Description
X-RateLimit-Limit
number of requests allowed in a time period.
X-RateLimit-Period
length of time in seconds for resets (calendar aligned).
X-RateLimit-Remaining
number of allowed requests left in the current time period.
This is sample postmortem reporting to review chronologies, provide the mitigation from the issue and solving the problem during period time
Title
YYYY-MM-DD Issue Name. eg: 2020-09-01 Failed to Replicate Database Slave in Node-2.
Issue Summary
Summary of issue that describe all chronologies. eg: We had issue in replication slave server database in node-2. This issue running at 07:00 due to can’t connect the slave server DNS to DNS server master. Impacted to unable connected for some of microservices that using slave server as pointing reading / query read to database.
List of microservices impacted:
Microservices 1: Auth
Microservices 2: OTP
Impact
List of microservices or other infrastructure resources impacted for this issue. eg: Impacted microservices:
Microservices 1: Auth
Microservices 2: OTP
Impacted infra: DNS slave
Trigger
List of trigger issue. eg:
Cloud provider running on maintenance starting at 2020-09-01 02:00 GMT+7 and end at 2020-09-01 03:00.
Some of DNS changed as the impacted of maintenance.
Detection
List of detection issue. eg:
Detect on Metrics for failed replication (with snapshot picture)
Detect on Log for dns changes (with snapshot picture)
Root Cause
List of root cause for the issue. eg:
Slave server database in node-2 can’t running due to can’t connect to DNS server master.
DNS server master had been moved to other pointing address due to cloud provider maintenance.
Timeline
List timeline issue from beginning until end (resolved). eg: 2020-09-01 07:00 Metrics show failed to replicate the slave server database in node-2 2020-09-01 07:10 Raise the alert on P3 Escalation 2020-09-01 07:12 Oncall ack the issue 2020-09-01 07:15 Taking action for manual replication slave server 2020-09-01 07:30 All Replication had been restored 2020-09-01 07:35 Monitoring phase replication (for about 10-15 minutes) 2020-09-01 08:00 Operation slave server database in node-2 is back to normal
Resolution & Recovery
List of resolution & recovery action eg:
Manual replication for slave server
Repointing DNS slave node-2 to new DNS master
Corrective and Preventive Measurements
List of action item / procedure to make correction & prevention (as mitigation) eg:
Update threshold metrics for alerting, raise to P2 for escalation level.
Raise open ticket for cloud provider dns issue moving impact.
Financial Impact
Product Impacted
Start DateTime – End DateTime
Impact Type (Outage, Error Rates, Latency Spike)
Monitoring Links
Log Links
Detail of Financial Impact
Division / Team Name
List of division / team which impacted for this postmortem
Related Documents
Related documentation for this issue (JIRA / Confluences)
You must be logged in to post a comment.