A Way to terminate idle AWS resources

6 min readMar 21, 2021

Are you interested to find on how to terminate idle EMR clusters in an AWS account using SignalFx integration then read on by the end of the blog you will have a better understanding of how to make that happen?

Background

Cloud infrastructure and containers are used to build most applications along with third-party tools. Monitoring cloud-based applications require analyzing large amounts of data from hundreds of different sources. Within cloud operations, determining which alerting system is an optimal fit for the platform’s architecture can prove to be a challenging and complex problem to solve. One of the most important aspects of an alerting system is to control the number of messages sent for a particular alert. Sometimes alert fatigue leads to losing confidence in an alerting system. Traditional approaches for cloud-based environment alerting are not flexible enough to meet our needs. SignalFx provides us several capabilities that allow us to solve complex real-time monitoring and alerting issues.

As the technology evolves, observability tools and processes must evolve as well. We use SignalFx as an observability solution. Based on our use case, we automated the creation of dashboards and detectors using the signal-analog library. The team offers AWS Elastic MapReduce (EMR) backed Software-as-a-Service (SaaS) to its customers. Often, customers forget to terminate EMR clusters, which leads to the idle running of the clusters and, in turn, adds up unnecessary costs.

To reduce the costs incurred by idle EMR clusters, we have implemented closed-loop automation with SignalFx, which helps in terminating idle EMR clusters. SignalFx and Amazon EventBridge integration leverages real-time issue detection mechanisms to perform remediation actions when an event is triggered. This integration enables the team to operate applications and infrastructure with closed-loop automation to improve SLA compliance and eventually reduce the time to respond and resolution for an issue. The implementation of closed-loop automation also reduces the number of times that on-call staff is interrupted from tasks in their current sprint to fix production issues.

It is estimated that the daily cost for idle EMR clusters ranges from around 40 dollars to more than 700 dollars. Every day, we spin-up nearly 2000 EMR Clusters in the AWS account. With the Closed-Loop SignalFx Automation approach, we are cutting costs by an estimated 8% month-over-month and dropping our average annual cost by 2%, if we continue this trend.

In this post, we will go through how we monitor idle AWS EMR clusters and terminate them based on the SLA (Service-Level Agreement) set by the team.

The Architecture of Closed-Loop Automation

Architecture of Idle EMR Termination mechanism — Architecture on the implementation of Idle Termination mechanism.

Role of each Component

Amazon Elastic MapReduce

Amazon EMR is a big data platform that processes petabytes of data, performs heavy transformations and joins, and runs spark jobs.
When an AWS EMR cluster is created, it executes several steps. Steps are a set of actions that the cluster needs to perform. One step is an action to install a script and schedule its execution every five minutes through a cron expression.
Whenever an EMR cluster is active and executing jobs, a custom metric called emr-running is set to 1, whereas for inactive clusters, emr-running is set to 0.
SSH, Yarn, Zeppelin, and Presto metrics are taken into consideration when deciding on the value for the emr-running metric.
SSH metric checks for any active SSH connections to the master node.
Yarn is the resource manager of the Hadoop ecosystem. Yarn metric checks for any running jobs in Yarn.
Interactive data exploration can be performed using Apache Zeppelin. Zeppelin metric checks for any running jobs in the Zeppelin notebook.
Presto is a distributed query engine that runs interactive queries. Presto metric checks for any running presto queries.
If SSH, Yarn, Zeppelin, and Presto metrics are not running, we set the emr-running metric to 0. We are pushing emr-running, SSH-running, Yarn-running, Zeppelin-running, and Presto-running custom metrics to SignalFx, which are set in the AWS EMR cluster.

AWS CloudWatch Metrics and SignalFx API

Metrics on the performance of systems across AWS can be accessed from CloudWatch.
Apart from the default metrics, you can also publish custom metrics to CloudWatch using the AWS CLI or API.
In our first approach, we were using the AWS API to push custom metrics to CloudWatch.
When SignalFx was synching CloudWatch metrics from AWS into SignalFx, CloudWatch service was throttled, so we took an alternate route and started to push custom CloudWatch metrics directly to SignalFx via the SignalFx API.
Along with the custom CloudWatch metrics, we also pushed default AWS CloudWatch metrics to SignalFx through SignalFx’s integration with AWS.

SignalFx Metrics Finder

Using SignalFx finder, you can find metrics sent by CloudWatch and custom metrics sent through SignalFx API.
This helps to validate that data is getting into SignalFx before moving into building dashboards, charts, and detectors.

SignalFx Detectors

SignalFx Detector is an alerting mechanism that performs a wide range of events, from alerting to integrating with systems like Amazon EventBridge, Slack, and PagerDuty.
SignalFx Detectors perform actions based on alerts received. Based on SLAs, different targets can be set to alert and perform necessary steps. In our scenario, upon receiving an emr-running metric for a period of 15 minutes with a value of 0, it would send a warning alert to a Slack channel. Upon receiving events for a period of 60 minutes with a value of 0, it would send an event to Amazon EventBridge and send an EMR termination alert to a Slack channel.

Amazon EventBridge

Amazon EventBridge delivers a stream of real-time events from SaaS applications, like SignalFx, and it routes that data to targets like AWS Lambda.
Event-driven applications can be built using EventBridge, as it takes care of security, authorization, event ingestion and delivery, and handling errors. We are using EventBridge to send SignalFx Detector events to Lambda to handle the termination of underutilized EMR clusters.

AWS Lambda

AWS Lambda can run code for any type of application. Lambda executes arbitrary code with a minimal amount of infrastructure setup time.
We are running Lambda functions written in python to terminate AWS EMR clusters. Based on SLAs defined by our team, rules are configured in SignalFx Detector.
Whenever a particular SLA is met, SignalFx Detector sends an event to EventBridge.
Once an event is received from EventBridge, a lambda written in python which has access to EMR clusters, and RDS would terminate the cluster and keep a record in RDS for audit purpose on who have spun the cluster and when the cluster was terminated.