How FINRA established real-time operational observability for Amazon EMR big data workloads on Amazon EC2 with Prometheus and Grafana

This is a guest post by FINRA (Financial Industry Regulatory Authority). FINRA is dedicated to protecting investors and safeguarding market integrity in a manner that facilitates vibrant capital markets.

FINRA performs big data processing with large volumes of data and workloads with varying instance sizes and types on Amazon EMR. Amazon EMR is a cloud-based big data environment designed to process large amounts of data using open source tools such as Hadoop, Spark, HBase, Flink, Hudi, and Presto.

Monitoring EMR clusters is essential for detecting critical issues with applications, infrastructure, or data in real time. A well-tuned monitoring system helps quickly identify root causes, automate bug fixes, minimize manual actions, and increase productivity. Additionally, observing cluster performance and usage over time helps operations and engineering teams find potential performance bottlenecks and optimization opportunities to scale their clusters, thereby reducing manual actions and improving compliance with service level agreements.

In this post, we talk about our challenges and show how we built an observability framework to provide operational metrics insights for big data processing workloads on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) clusters.

Challenge

In today’s data-driven world, organizations strive to extract valuable insights from large amounts of data. The challenge we faced was finding an efficient way to monitor and observe big data workloads on Amazon EMR due to its complexity. Monitoring and observability for Amazon EMR solutions come with various challenges:

Complexity and scale – EMR clusters often process massive volumes of data across numerous nodes. Monitoring such a complex, distributed system requires handling high data throughput and achieving minimal performance impact. Managing and interpreting the large volume of monitoring data generated by EMR clusters can be overwhelming, making it difficult to identify and troubleshoot issues in a timely manner.
Dynamic environments – EMR clusters are often ephemeral, created and shut down based on workload demands. This dynamism makes it challenging to consistently monitor, collect metrics, and maintain observability over time.
Data variety – Monitoring cluster health and having visibility into clusters to detect bottlenecks, unexpected behavior during processing, data skew, job performance, and so on are crucial. Detailed observability into long-running clusters, nodes, tasks, potential data skews, stuck tasks, performance issues, and job-level metrics (like Spark and JVM) is very critical to understand. Achieving comprehensive observability across these varied data types was difficult.
Resource utilization – EMR clusters consist of various components and services working together, making it challenging to effectively monitor all aspects of the system. Monitoring resource utilization (CPU, memory, disk I/O) across multiple nodes to prevent bottlenecks and inefficiencies is essential but complex, especially in a distributed environment.
Latency and performance metrics –Capturing and analyzing latency and comprehensive performance metrics in real time to identify and resolve issues promptly is critical, but it’s challenging due to the distributed nature of Amazon EMR.
Centralized observability dashboards – Having a single pane of glass for all aspects of EMR cluster metrics, including cluster health, resource utilization, job execution, logs, and security, in order to provide a complete picture of the system’s performance and health, was a challenge.
Alerting and incident management – Setting up effective centralized alerting and notification systems was challenging. Configuring alerts for critical events or performance thresholds requires careful consideration to avoid alert fatigue while making sure important issues are addressed promptly. Responding to incidents from performance slowdowns or disruptions takes time and effort to detect and remediate the issues if proper alerting mechanism is not in place.
Cost management – Lastly, optimizing costs while maintaining effective monitoring is an ongoing challenge. Balancing the need for comprehensive monitoring with cost constraints requires careful planning and optimization strategies to avoid unnecessary expenses while still providing adequate monitoring coverage.

Effective observability for Amazon EMR requires a combination of the right tools, practices, and strategies to address these challenges and provide reliable, efficient, and cost-effective big data processing.

The Ganglia system on Amazon EMR is designed to monitor complete cluster and all nodes’ health, which shows several metrics like Hadoop, Spark, and JVM. When we view the Ganglia web UI in a browser, we see an overview of the EMR cluster’s performance, detailing the load, memory usage, CPU utilization, and network traffic of the cluster through different graphs. However, with Ganglia’s deprecation announced by AWS for higher versions of Amazon EMR, it became important for FINRA to build this solution.

Solution overview

Insights drawn from the post Monitor and Optimize Analytic Workloads on Amazon EMR with Prometheus and Grafana inspired our approach. The post demonstrated how to set up a monitoring system using Amazon Managed Service for Prometheus and Amazon Managed Grafana to effectively monitor an EMR cluster and use Grafana dashboards to view metrics to troubleshoot and optimize performance issues.

Based on these insights, we completed a successful proof of concept. Next, we built our enterprise central monitoring solution with Managed Prometheus and Managed Grafana to mimic Ganglia-like metrics at FINRA. Managed Prometheus allows for real-time high-volume data collection, which scales the ingestion, storage, and querying of operational metrics as workloads increase or decrease. These metrics are fed to the Managed Grafana workspace for visualizations.

Our solution includes a data ingestion layer for every cluster, with configuration for metrics collection through a custom-built script stored in Amazon Simple Storage Service (Amazon S3). We also installed Managed Prometheus at startup for EC2 instances on Amazon EMR through a bootstrap script. Additionally, application-specific tags are defined in the configuration file to optimize inclusion and collect the specific metrics.

After Managed Prometheus (installed on EMR clusters) collects the metrics, they are sent to a remote Managed Prometheus workspace. Managed Prometheus workspaces are logical and isolated environments dedicated to Managed Prometheus servers that manage specific metrics. They also provide access control for authorizing who or what sends and receives metrics from that workspace. You can create one more workspace by account or application depending on the need, which facilitates better management.

After metrics are collected, we built a mechanism to render them on Managed Grafana dashboards that are then used for consumption through an endpoint. We customized the dashboards for task-level, node-level, and cluster-level metrics so they can be promoted from lower environments to higher environments. We also built several templated dashboards that display node-level metrics like OS-level metrics (CPU, memory, network, disk I/O), HDFS metrics, YARN metrics, Spark metrics, and job-level metrics (Spark and JVM), maximizing the potential for each environment through automated metric aggregation in each account.

We chose a SAML-based authentication option, which allowed us to integrate with existing Active Directory (AD) groups, helping minimize the work needed to manage user access and grant user-based Grafana dashboard access. We arranged three main groups—admins, editors, and viewers—for Grafana user authentication based on user roles.

Through elaborate monitoring automation, these desired metrics are pushed to Amazon CloudWatch. We use CloudWatch for necessary alerting when it exceeds the desired thresholds for each metric.

The following diagram illustrates the solution architecture.

Sample dashboards

The following screenshots showcase example dashboards.

Conclusion

In this post, we shared how FINRA enhanced data-driven decision-making with comprehensive EMR workload observability to optimize performance, maintain reliability, and gain critical insights into big data operations, leading to operational excellence.

FINRA’s solution enabled the operations and engineering teams to use a single pane of glass for monitoring big data workloads and quickly detecting any operational issues. The scalable solution significantly reduced time to resolution and enhanced our overall operational stance. The solution empowered the operations and engineering teams with comprehensive insights into various Amazon EMR metrics like OS levels, Spark, JMX, HDFS, and Yarn, all consolidated in one place. We also extended the solution to use cases such as Amazon Elastic Kubernetes Service (Amazon EKS) clusters, including EMR on EKS clusters and other applications, establishing it as a one-stop system for monitoring metrics across our infrastructure and applications.

About the Authors

Sumalatha Bachu is Senior Director, Technology at FINRA. She manages Big Data Operations which includes managing petabyte-scale data and complex workloads processing in cloud. Additionally, she is an expert in developing Enterprise Application Monitoring and Observability Solutions, Operational Data Analytics, & Machine Learning Model Governance work flows. Outside of work, she enjoys doing yoga, practicing singing, and teaching in her free time.

PremKiran Bejjam is Lead Engineer Consultant at FINRA, specializing in developing resilient and scalable systems. With a keen focus on designing monitoring solutions to enhance infrastructure reliability, he is dedicated to optimizing system performance. Beyond work, he enjoys quality family time and continually seeks out new learning opportunities.

Akhil Chalamalasetty is Director, Market Regulation Technology at FINRA. He is a Big Data subject matter expert specializing in building cutting edge solutions at scale along with optimizing workloads, data, and its processing capabilities. Akhil enjoys sim racing and Formula 1 in his free time.