Hybrid big data analytics with Amazon EMR on AWS Outposts

Businesses require powerful and flexible tools to manage and analyze vast amounts of information. Amazon EMR has long been the leading solution for processing big data in the cloud. Amazon EMR is the industry-leading big data solution for petabyte-scale data processing, interactive analytics, and machine learning using over 20 open source frameworks such as Apache Hadoop, Hive, and Apache Spark. However, data residency requirements, latency issues, and hybrid architecture needs often challenge purely cloud-based solutions.

Enter Amazon EMR on AWS Outposts—a groundbreaking extension that brings the power of Amazon EMR directly to your on-premises environments. This innovative service merges the scalability, performance (the Amazon EMR runtime for Apache Spark is 4.5 times more performant than Apache Spark 3.5.1), and ease of Amazon EMR with the control and proximity of your data center, empowering enterprises to meet stringent regulatory and operational requirements while unlocking new data processing possibilities.

In this post, we dive into the transformative features of EMR on Outposts, showcasing its flexibility as a native hybrid data analytics service that allows seamless data access and processing both on premises and in the cloud. We also explore how it integrates smoothly with your existing IT infrastructure, providing the flexibility to keep your data where it best fits your needs while performing computations entirely on premises. We examine a hybrid setup where sensitive data remains locally in Amazon S3 on Outposts and public data in an AWS Regional Amazon Simple Storage Service bucket. This configuration allows you to augment your sensitive on-premises data with cloud data while making sure all data processing and compute runs on-premises in AWS Outposts Racks.

Solution overview

Consider a fictional company named Oktank Finance. Oktank aims to build a centralized data lake to store vast amounts of structured and unstructured data, enabling unified access and supporting advanced analytics and big data processing for data-driven insights and innovation. Additionally, Oktank must comply with data residency requirements, making sure that confidential data is stored and processed strictly on premises. Oktank also needs to enrich their datasets with non-confidential and public market data stored in the cloud on Amazon S3, which means they should be able to join datasets across their on-premises and cloud data stores.

Traditionally, Oktank’s big data platforms tightly coupled compute and storage resources, creating an inflexible system where decommissioning compute nodes could lead to data loss. To avoid this situation, Oktank aims to decouple compute from storage, allowing them to scale down compute nodes and repurpose them for other workloads without compromising data integrity and accessibility.

To meet these requirements, Oktank decides to adopt Amazon EMR on Outposts as their big data analytics platform and Amazon S3 on Outposts as their on-premises data store for their data lake. With EMR on Outposts, Oktank can make sure that all compute occurs on premises within their Outposts rack while still being able to query and join the public data stored in Amazon S3 with their confidential data stored in S3 on Outposts, using the same unified data APIs. For data processing, Oktank can choose from a wide variety of applications available on Amazon EMR. In this post, we use Spark as the data processing framework.

This approach makes sure that all data processing and analytics are performed locally within their on-premises environment, allowing Oktank to maintain compliance with data privacy and regulatory requirements. Simultaneously, by avoiding the need to replicate public data to their on-premises data centers, Oktank reduces storage costs and simplifies their end-to-end data pipelines by eliminating additional data movement jobs.

The following diagram illustrates the high-level solution architecture.

As explained earlier, the S3 on Outposts bucket in the architecture holds Oktank’s sensitive data, which stays on the Outpost in Oktank’s data center while the Regional S3 bucket holds the non-sensitive data.

In this post, to achieve high network performance from the Outpost to the Regional S3 bucket and vice-versa, we also use AWS Direct Connect with a virtual private gateway. This is especially beneficial when you need higher query throughput to the Regional S3 bucket by making sure the traffic is routed through your own dedicated network channel to AWS.

The solution involves deploying an EMR cluster on an Outposts rack. A service link connects AWS Outposts to a Region. The service link is a necessary connection between your Outposts and the Region (or home Region). It allows for the management of the Outposts and the exchange of traffic to and from the Region.

You can also access Regional S3 buckets using this service link. However, in this post, we employ an alternate option to enable the EMR cluster to privately access the Regional S3 bucket through the local gateway. This helps optimize data access from the Regional S3 bucket as traffic is routed through Direct Connect.

To enable the EMR cluster to access Amazon S3 privately over Direct Connect, a route is configured in the Outposts subnet (marked as 2 in the architecture diagram) to direct Amazon S3 traffic through the local gateway. Upon reaching the local gateway, the traffic is routed over Direct Connect (private virtual interface) to a virtual private gateway in the Region. The second VPC (5 in diagram), which includes the S3 interface endpoint, is connected to this virtual private gateway. A route is then added to make sure that traffic can return to the EMR cluster. This setup provides more efficient, higher-bandwidth communication between the EMR cluster and Regional S3 buckets.

For big data processing, we use Amazon EMR. Amazon EMR supports access to local S3 on Outposts with the Apache Hadoop S3A connector from Amazon EMR version 7.0.0 onwards. EMR File System (EMRFS) with S3 on Outposts is not supported. We use EMR Studio notebooks for running interactive queries on the data. We also submit Spark jobs as a step on the EMR cluster. We also use the AWS Glue Data Catalog as the external Hive compatible metastore, which serves as the central technical metadata catalog. The Data Catalog is a centralized metadata repository for all your data assets across various data sources. It provides a unified interface to store and query information about data formats, schemas, and sources. Additionally, we use AWS Lake Formation for access controls on the AWS Glue table. You still need to control the raw files access on the S3 on Outposts bucket with AWS Identity and Access Management (IAM) permissions in this architecture. At the time of writing, Lake Formation can’t directly manage access to data on the S3 on Outposts bucket. Access to the actual data files stored in the S3 on Outposts bucket is managed with IAM permissions.

In the following sections, you will implement this architecture for Oktank. We focus on a specific use case for Oktank Finance, where they maintain sensitive customer stockholding data in a local S3 on Outposts bucket. Additionally, they have publicly available stock details stored in a Regional S3 bucket. Their goal is to explore both the datasets within their on-premises Outpost setup. Additionally, they need to enrich the customer stock holdings data by combining it with the publicly available stock details data.

First, we explore how to access both datasets using an EMR cluster. Then, we demonstrate the process of performing joins between the local and public data. We also demonstrate how to use Lake Formation to effectively manage permissions for these tables. We explore two primary scenarios throughout this walkthrough. In the interactive use case, we demonstrate how users can connect to the EMR cluster and run queries interactively using EMR Studio notebooks. This approach allows for real-time data exploration and analysis. Additionally, we show you how to submit batch jobs to Amazon EMR using EMR steps for automated, scheduled data processing. This method is ideal for recurring tasks or large-scale data transformations.

Prerequisites

Complete the following prerequisite steps:

Have an AWS account and a role with administrator access. If you don’t have an account, you can create one.
Have an Outposts rack installed and running.
Create an EC2 key pair. This allows you to connect to the EMR cluster nodes even if Regional connectivity is lost.
Set up Direct Connect. This is required only if you want to deploy the second AWS CloudFormation template as explained in the following section.

Deploy the CloudFormation stacks

In this post, we’ve divided the setup into four CloudFormation templates, each responsible for provisioning a specific component of the architecture. The templates come with default parameters, which you may need to adjust based on your specific configuration requirements.

Stack1 provisions the network infrastructure on Outposts. It also creates the S3 on Outposts bucket and Regional S3 bucket. It copies the sample data to the buckets to simulate the data setup for Oktank. Confidential data for customer stock holdings is copied to the S3 on Outposts bucket, and non-confidential data for stock details is copied to the Regional S3 bucket.

Stack2 provisions the infrastructure to connect to the Regional S3 bucket privately using Direct Connect. It establishes a VPC with private connectivity to both the regional S3 bucket and the Outposts subnet. It also creates an Amazon S3 VPC interface endpoint to allow private access to Amazon S3. It establishes a virtual private gateway for connectivity between the VPC and Outposts subnet. Lastly, it configures a private Amazon Route 53 hosted zone for Amazon S3, enabling private DNS resolution for S3 endpoints within the VPC. You can skip deploying this stack if you don’t need to route traffic using Direct Connect.

Stack3 provisions the EMR cluster infrastructure, AWS Glue database, and AWS Glue tables. The stack creates an AWS Glue database named oktank_outpostblog_temp and three tables under it: stock_details, stockholdings_info, and stockholdings_info_detailed. The table stock_details contains public information for the stocks, and the data location of this table points to the Regional S3 bucket. The tables stockholdings_info and stockholdings_info_detailed contain confidential information, and their data location is in the S3 on Outposts bucket. It also creates a runtime role named outpostblog-runtimeRole1. A runtime role is an IAM role that you associate with an EMR step, and jobs use this role to access AWS resources. With runtime roles for EMR steps, you can specify different IAM roles for the Spark and the Hive jobs, thereby scoping down access at a job level. This allows you to simplify access controls on a single EMR cluster that is shared between multiple tenants, wherein each tenant can be isolated using IAM roles. This stack also grants the required permissions on the runtime role to grant access on the Regional S3 bucket and the S3 on Outposts bucket. The EMR cluster uses a bootstrap action that runs a script to copy sample data to the S3 on Outposts bucket and the Regional S3 bucket for the two tables.

Stack4 provisions the EMR Studio. We will connect to EMR Studio notebook and interact with the data stored across S3 on Outposts and the Regional S3 bucket. This stack outputs the EMR Studio URL, which you can use to connect to EMR Studio.

Run the preceding CloudFormation stacks in sequence with an admin role to create the solution resources.

Access the data and join tables

To verify the solution, complete the following steps:

On the AWS CloudFormation console, navigate to the Outputs tab of Stack4, which deployed the EMR Studio, and choose the EMR Studio URL.

This will open EMR Studio in a new window.

Create a workspace and use the default options.

The workspace will launch in a new tab.

Connect to the EMR cluster using the runtime role (outpostblog-runtimeRole1).

You are now connected to the EMR cluster.

Choose the File Browser tab and open the notebook while choosing the kernel as PySpark.
Run the following query in the notebook to read from the stock details table. This table points to public data stored in the Regional S3 bucket.
```
spark.sql("select * from oktank_outpostblog_temp.stock_details").show(5)
```
Run the following query to read from the confidential data stored in the local S3 on Outposts bucket:
```
spark.sql("select * from oktank_outpostblog_temp.stockholdings_info").show(5)
```

As highlighted earlier, one of the requirements for Oktank is to enrich the preceding data with data from the Regional S3 bucket.

Run the following query to join the preceding two tables:

spark.sql("select customerid,sharesheld,purchasedate, a.stockid, b.stockname,b.category,b.currentprice from oktank_outpostblog_temp.stockholdings_info a inner join oktank_outpostblog_temp.stock_details b on a.stockid=b.stockid order by customerid").show(10)

Control access to tables using Lake Formation

In this post, we also showcase how you can control access to the tables using Lake Formation. To demonstrate, let’s block access to RuntimeRole1 on the stockholdings_info table.

On the Lake Formation console, choose Tables in the navigation pane.
Select the table stockholdings_info and on the Actions menu, choose View to view the current access permissions on this table.
Select IAMAllowedPrincipals from the list of principals and choose Revoke to revoke the permission.
Go back to the EMR Studio notebook and rerun the earlier query.

Oktank’s data access query fails because Lake Formation has denied permission to the runtime role; you will need to adjust the permissions.

To resolve this issue, return to the Lake Formation console, select the stockholdings_info table, and on the Actions menu, choose Grant.
Assign the necessary permissions to the runtime role to make sure it can access the table.
Select IAM users and roles and choose the runtime role (outpostblog-runtimeRole1).
Choose the table stockholdings_info from the list of tables and for Table permissions, select Select.
Select All data access and choose Grant.
Go back to the notebook and rerun the query.

The query now succeeds because we granted access to the runtime role connected to the EMR cluster through the EMR Studio notebook. This demonstrates how Lake Formation allows you to manage permissions on your Data Catalog tables.

The previous steps only restrict access to the table in the catalog, not to the actual data files stored in the S3 on Outposts bucket. To control access to these data files, you need to use IAM permissions. As mentioned earlier, Stack3 in this post handles the IAM permissions for the data. For access control on the Regional S3 bucket with Lake Formation, you don’t need to specifically provide IAM permissions on the specific S3 bucket to the roles. Lake Formation manages the Regional S3 bucket access controls for runtime roles. Refer to Introducing runtime roles for Amazon EMR steps: Use IAM roles and AWS Lake Formation for access control with Amazon EMR for detailed guidance on managing access to a Regional S3 bucket with Lake Formation and EMR runtime roles.

Submit a batch job

Next, let’s submit a batch job as an EMR step on the EMR cluster. Before we do that, let’s confirm there is currently no data in the table stockholdings_info_detailed. Run the following query in the notebook:

spark.sql("select * from oktank_outpostblog_temp.stockholdings_info_detailed").show(10)

You will not see any data in this table. You can now detach the notebook from the cluster.
You will now insert data in this table using a batch job submitted as an EMR step.

On the EMR console, navigate to the cluster EMROutpostBlog and submit a step.
Choose Spark Application for Type.
Select the py script from the scripts folder in your S3 bucket created by the CloudFormation template.
For Permissions, choose the runtime role (outpostblog-RuntimeRole1).
Choose Add step to submit the job.

Wait for the job to complete. The job inserted data into the stockholdings_info_detailed table. You can rerun the earlier query in the notebook to verify the data:

spark.sql("select * from oktank_outpostblog_temp.stockholdings_info_detailed").show(10)

Clean up

To avoid incurring further charges, delete the CloudFormation stacks.

Before deleting Stack4, run the following shell command (with the %%sh magic command) in the EMR Studio notebook to delete the objects from the S3 on Outposts bucket:

aws s3api delete-objects --bucket <replace with value of key S3OutpostBucketAccessPointAlias1 from stack 3 output> --delete "$(aws s3api list-object-versions --bucket <replace with value of key S3OutpostBucketAccessPointAlias1 from stack 3 output> --output=json | jq '{Objects: [.Versions[]|{Key:.Key,VersionId:.VersionId}], Quiet: true}')"

Next, manually delete the EMR workspace from the EMR Studio.
You can now delete the stacks, starting with Stack4, Stack3, Stack2, and finally Stack1.

Conclusion

In this post, we demonstrated how to use Amazon EMR on Outposts as a managed big data processing service in your on-premises setup. We explored how you can set up the cluster to access data stored in an S3 on Outposts bucket on premises and also efficiently access data in the Regional S3 bucket with private networking. We also explored Glue Data Catalog as a serverless external Hive metastore and managed access control to the catalog tables using Lake Formation. We accessed the data interactively using EMR Studio notebooks and processed it as a batch job using EMR steps.

To learn more, visit Amazon EMR on AWS Outposts.

For further reading, refer to the following resources:

About the Authors

Shoukat Ghouse is a Senior Big Data Specialist Solutions Architect at AWS. He helps customers around the world build robust, efficient and scalable data platforms on AWS leveraging AWS analytics services like AWS Glue, AWS Lake Formation, Amazon Athena and Amazon EMR.

Fernando Galves is an Outpost Solutions Architect at AWS, specializing in networking, security, and hybrid cloud architectures. He helps customers design and implement secure hybrid environments using AWS Outposts, focusing on complex networking solutions and seamless integration between on-premises and cloud infrastructure.