Amazon CloudWatch

CloudWatch is a robust monitoring solution for AWS resources.

Region scope
Fault tolerant
Durable
Push, not pull

Main entities:

Log-groups
Log-stream

Functionalities

Push, not pull
Collects and tracks data points in time, called metrics, with a default frequency of 1 minute?
A dimension is a name/value pair to help identify a metric (max 10).
(example: InstanceId=1-234567 / InstanceType=m1.large)
A statistic is an aggregation of metric values over time
(sum, minimum, maximum, average, sampleCount, pNN.NN = percentile)
A namespace is a container for a collection of related metrics
(example: AWS/EC2/CPUUtilization)
Enables you to create alarms, based on metrics, and send notifications to targets
Can trigger changes in capacity, based on rules that you set
An Event describes changes in AWS resources
A Rule matches incoming events and routes them to targets
A Target processes events (Lambda, Kinesis, SNS topics, SQS queues…)

CloudWatch alarms

Period
Length of time to evaluate metric or expression to create an individual data point
Evaluation period
number of data points to evaluate (ie 10)
Datapoints to alarm
how many datapoints within evaluation period must be breaching to cause alarm state (ie 5)
Evaluation range
how many datapoints retrieved by CloudWatch for alarm evaluation (greater than evaluation period)

Every datapoint can be NotBreaching, Breaching or Missing.

Missing datapoints can be classified as:
- missing -> not considered
- notBreaching -> as if it was within threshold
- breaching -> as if if was breaching threshold
- ignore -> current alarms state is maintained

Alarm Actions (remediation)

Commands

SNS topics (email/sms/lambda/etc)
EC2 actions (stop/reboot/terminate/recover)
Auto Scaling actions
- Reaction time of several minutes
- Truly elastic (increase/decrease)
Systems Manager (SSM) OpsItem
start-query / stop-query
get-query-results

Composite alarms

Usually you can have multiple alarms without action, and then you can define a single alarm that performs an action evaluating a query with a syntax like: ALARM(“ONE”) OR ALARM(“two”)

CloudWatch Agent

Enables you to do the following:

Collect internal system-level metrics from EC2 instances across operating systems:
- cpu_time_xxx
- disk_xxx
- mem_xxx
- net_xxx
- processes_xxx
Collect system-level metrics from on-premises servers
Retrieve custom metrics from your applications or services using:
- StatsD (supported on both Linux and Windows Server)
- collectd (supported only on Linux servers)
Collect logs from Amazon EC2 instances and on-premises servers, running either Linux or Windows Server

Can also be installed via SSM.

CloudWatch metric filters

Match everything -> “ “
Single term -> “ERROR” (NB: case sensitive)
Include/exclude terms -> “ERROR” - “permissions”
Multiple terms using AND -> “ERROR memory exception”
Multiple terms using OR -> ?ERROR ?WARN

Space delimited metric filter examples

[ip, id, user, timestamp, request, status_code = 4*, size]
-> Match all 4XX codes
[ip, id, user, timestamp, request, status_code, size > 1000]
-> Match response sizes > 1000 bytes
[ip, id, user, timestamp, request, status_code != 3*, size]
-> Ignore all redirect responses

CloudWatch Logs Insights

Can be used to analyze your logs in seconds, with fast and interactive queries and visualizations.

Commands

start-query / stop-query / get-query-results

CloudWatch Agent

Can be installed on EC2 to collect custom metrics (can be also installed via SSM)

AWS Study Guide

New pages coming soon...