Amazon CloudWatch
CloudWatch is a robust monitoring solution for AWS resources.
- Region scope
- Fault tolerant
- Durable
- Push, not pull
Main entities:
- Log-groups
- Log-stream
Functionalities
- Push, not pull
- Collects and tracks data points in time, called metrics, with a default frequency of 1 minute?
- A dimension is a name/value pair to help identify a metric (max 10).
(example: InstanceId=1-234567 / InstanceType=m1.large) - A statistic is an aggregation of metric values over time
(sum, minimum, maximum, average, sampleCount, pNN.NN = percentile) - A namespace is a container for a collection of related metrics
(example: AWS/EC2/CPUUtilization) - Enables you to create alarms, based on metrics, and send notifications to targets
- Can trigger changes in capacity, based on rules that you set
- An Event describes changes in AWS resources
- A Rule matches incoming events and routes them to targets
- A Target processes events (Lambda, Kinesis, SNS topics, SQS queues…)
CloudWatch alarms
- Period
Length of time to evaluate metric or expression to create an individual data point - Evaluation period
number of data points to evaluate (ie 10) - Datapoints to alarm
how many datapoints within evaluation period must be breaching to cause alarm state (ie 5) - Evaluation range
how many datapoints retrieved by CloudWatch for alarm evaluation (greater than evaluation period)
Every datapoint can be NotBreaching, Breaching or Missing.
Missing datapoints can be classified as:
- missing -> not considered
- notBreaching -> as if it was within threshold
- breaching -> as if if was breaching threshold
- ignore -> current alarms state is maintained
Alarm Actions (remediation)
Commands
- SNS topics (email/sms/lambda/etc)
- EC2 actions (stop/reboot/terminate/recover)
- Auto Scaling actions
- Reaction time of several minutes
- Truly elastic (increase/decrease)
- Systems Manager (SSM) OpsItem
- start-query / stop-query
- get-query-results
Composite alarms
Usually you can have multiple alarms without action, and then you can define a single alarm that performs an action evaluating a query with a syntax like: ALARM(“ONE”) OR ALARM(“two”)
CloudWatch Agent
Enables you to do the following:
- Collect internal system-level metrics from EC2 instances across operating systems:
- cpu_time_xxx
- disk_xxx
- mem_xxx
- net_xxx
- processes_xxx
- Collect system-level metrics from on-premises servers
- Retrieve custom metrics from your applications or services using:
StatsD
(supported on both Linux and Windows Server)collectd
(supported only on Linux servers)
- Collect logs from Amazon EC2 instances and on-premises servers, running either Linux or Windows Server
Can also be installed via SSM.
CloudWatch metric filters
- Match everything -> “ “
- Single term -> “ERROR” (NB: case sensitive)
- Include/exclude terms -> “ERROR” - “permissions”
- Multiple terms using AND -> “ERROR memory exception”
- Multiple terms using OR -> ?ERROR ?WARN
Space delimited metric filter examples
- [ip, id, user, timestamp, request, status_code = 4*, size]
-> Match all 4XX codes - [ip, id, user, timestamp, request, status_code, size > 1000]
-> Match response sizes > 1000 bytes - [ip, id, user, timestamp, request, status_code != 3*, size]
-> Ignore all redirect responses
CloudWatch Logs Insights
Can be used to analyze your logs in seconds, with fast and interactive queries and visualizations.
Commands
start-query / stop-query / get-query-results
CloudWatch Agent
Can be installed on EC2 to collect custom metrics (can be also installed via SSM)
-
Introduction
- Concepts
- Networking
- Management
- Security, Identity and compliance
- Compute and containers
- Storage
- Databases
- Other services
New pages coming soon...