service-logo

Amazon CloudWatch

CloudWatch is a robust monitoring solution for AWS resources.

  • Region scope
  • Fault tolerant
  • Durable
  • Push, not pull

Main entities:

  • Log-groups
  • Log-stream

Functionalities

  • Push, not pull
  • Collects and tracks data points in time, called metrics, with a default frequency of 1 minute?
  • A dimension is a name/value pair to help identify a metric (max 10).
    (example: InstanceId=1-234567 / InstanceType=m1.large)
  • A statistic is an aggregation of metric values over time
    (sum, minimum, maximum, average, sampleCount, pNN.NN = percentile)
  • A namespace is a container for a collection of related metrics
    (example: AWS/EC2/CPUUtilization)
  • Enables you to create alarms, based on metrics, and send notifications to targets
  • Can trigger changes in capacity, based on rules that you set
  • An Event describes changes in AWS resources
  • A Rule matches incoming events and routes them to targets
  • A Target processes events (Lambda, Kinesis, SNS topics, SQS queues…)

CloudWatch alarms

  • Period
    Length of time to evaluate metric or expression to create an individual data point
  • Evaluation period
    number of data points to evaluate (ie 10)
  • Datapoints to alarm
    how many datapoints within evaluation period must be breaching to cause alarm state (ie 5)
  • Evaluation range
    how many datapoints retrieved by CloudWatch for alarm evaluation (greater than evaluation period)

Every datapoint can be NotBreaching, Breaching or Missing.

Missing datapoints can be classified as:
- missing -> not considered
- notBreaching -> as if it was within threshold
- breaching -> as if if was breaching threshold
- ignore -> current alarms state is maintained

Alarm Actions (remediation)

Commands

  • SNS topics (email/sms/lambda/etc)
  • EC2 actions (stop/reboot/terminate/recover)
  • Auto Scaling actions
    • Reaction time of several minutes
    • Truly elastic (increase/decrease)
  • Systems Manager (SSM) OpsItem
  • start-query / stop-query
  • get-query-results

Composite alarms

Usually you can have multiple alarms without action, and then you can define a single alarm that performs an action evaluating a query with a syntax like: ALARM(“ONE”) OR ALARM(“two”)

CloudWatch Agent

Enables you to do the following:

  • Collect internal system-level metrics from EC2 instances across operating systems:
    • cpu_time_xxx
    • disk_xxx
    • mem_xxx
    • net_xxx
    • processes_xxx
  • Collect system-level metrics from on-premises servers
  • Retrieve custom metrics from your applications or services using:
    • StatsD (supported on both Linux and Windows Server) 
    • collectd (supported only on Linux servers)
  • Collect logs from Amazon EC2 instances and on-premises servers, running either Linux or Windows Server

Can also be installed via SSM.

CloudWatch metric filters

  • Match everything -> “ “
  • Single term -> “ERROR” (NB: case sensitive)
  • Include/exclude terms -> “ERROR” - “permissions”
  • Multiple terms using AND -> “ERROR memory exception”
  • Multiple terms using OR -> ?ERROR ?WARN

Space delimited metric filter examples

  • [ip, id, user, timestamp, request, status_code = 4*, size]
    -> Match all 4XX codes
  • [ip, id, user, timestamp, request, status_code, size > 1000]
    -> Match response sizes > 1000 bytes
  • [ip, id, user, timestamp, request, status_code != 3*, size]
    -> Ignore all redirect responses

CloudWatch Logs Insights

Can be used to analyze your logs in seconds, with fast and interactive queries and visualizations.

Commands

start-query / stop-query / get-query-results

CloudWatch Agent

Can be installed on EC2 to collect custom metrics (can be also installed via SSM)