In today’s data-driven world, organizations are constantly seeking efficient ways to process and analyze vast amounts of information across data lakes and warehouses.
Enter Amazon SageMaker Lakehouse, which you can use to unify all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI and machine learning (AI/ML) applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines. This opens up exciting possibilities for Open Source Apache Spark users who want to use SageMaker Lakehouse capabilities. Further you can secure your data in SageMaker Lakehouse by defining fine-grained permissions, which are enforced across all analytics and ML tools and engines.
In this post, we will explore how to harness the power of Open source Apache Spark and configure a third-party engine to work with AWS Glue Iceberg REST Catalog. The post will include details on how to perform read/write data operations against Amazon S3 tables with AWS Lake Formation managing metadata and underlying data access using temporary credential vending.
Solution overview
In this post, the customer uses Data Catalog to centrally manage technical metadata for structured and semi-structured datasets in their organization and wants to enable their data team to use Apache Spark for data processing. The customer will create an AWS Glue database and configure Apache Spark to interact with Glue Data Catalog using the Iceberg Rest API for writing/reading Iceberg data on Amazon S3 using Lake Formation permission control.
We will start by running an extract, transform, and load (ETL) script using Apache Spark to create an Iceberg table on Amazon S3 and access the table using the Glue Iceberg REST Catalog. The ETL script will add data to the Iceberg table and then read it back using Spark SQL. This post will showcase how this data can also be queried by other data teams using Amazon Athena .
Prerequisites
Access to an AWS Identity and Access Management (IAM) role that is a Lake Formation data lake administrator in the account that has the Data Catalog. For instructions, see Create a data lake administrator.
- Verify that you have Python version 3.7 or later installed. Check if pip3 version is 22.2.2 or higher is installed.
- Install or update the latest AWS Command Line Interface (AWS CLI). For instructions, see Installing or updating the latest version of the AWS CLI. Run aws configure using AWS CLI to point to your AWS account.
- Create an S3 bucket to store the customer Iceberg table. For this post, we will be using the us-east-2 AWS Region and will name the bucket:
ossblog-customer-datalake
. - Create an IAM role that will be used in OSS Spark for data access using an AWS Glue Iceberg REST catalog endpoint. Make sure that the role has AWS Glue and Lake Formation policies as defined in Data engineer permissions. For this post, we will use an IAM role named
spark_role
.
Enable Lake Formation permissions for third-party access
In this section, you will register the S3 bucket with Lake Formation. This step allows Lake Formation to act as a centralized permissions management system for metadata and data stored in Amazon S3, enabling more efficient and secure data governance in data lake environments.
- Create a user defined IAM role following the instructions in Requirements for roles used to register locations. For this post, we will use the IAM role:
LFRegisterRole
. - Register the S3 bucket
ossblog-customer-datalake
using the IAM roleLFRegisterRole
by running the following command:
Alternatively you can use the AWS Management Console for Lake Formation.
- Navigate to the Lake Formation console, choose Administration in the navigation pane, and then Data lake locations and provide the following values:
- For Amazon S3 path, select s3://ossblog-customer-datalake.
- For IAM role, select LFRegisterRole
- For Permission mode, choose Lake Formation.
- Choose Register location.
- In Lake Formation, enable full table access for external engines to access data.
- Sign in as an admin user, choose Administration in the navigation pane.
- Choose Application integration settings and select Allow external engines to access data in Amazon S3 locations with full table access.
- Choose Save.
Set up resource access for the OSS Spark role:
- Create an AWS Glue database called
ossblogdb
in the default catalog by going to the Lake Formation console and choosing Databases in the navigation pane. - Select the database, choose Edit and clear the checkbox for Use only IAM access control for new tables in this database.
Grant resource permission to OSS Spark role:
To enable OSS Spark to create and populate the dataset in the ossblogdb
database, you will use the IAM role (spark_role
) for Apache Spark instance that you created in step 4 of the prerequisites section. Apache Spark will assume this role to create an Iceberg table, add records to it and read from it. To enable this functionality, grant full table access to spark_role
and provide data location permission to the S3 bucket where the table data can be stored.
Grant create table permission to the spark_role:
Sign in as Datalake Admin and run the following command using AWS CLI:
Alternatively on the console:
- In the Lake Formation console navigation pane, choose Data lake permissions, and then choose Grant.
- In the Principals section, for IAM users and roles, select spark_role.
- In the LF-Tags or catalog resources section, select Named Data Catalog resources:
- Select <accountid> for Catalogs.
- Select ossblogdb for Databases.
- Select DESCRIBE and CREATE TABLE for Database permissions.
- Choose Grant.
Grant data location permission to the spark_role:
Sign in as Datalake Admin and run the following command using the AWS CLI:
Alternatively on the console:
- In the Lake Formation console navigation pane, choose Data Locations, and then choose Grant.
- For IAM users and roles, select spark_role.
- For Storage locations, select the bucket_name
- Choose Grant.
Set up a Spark script to use an AWS Glue Iceberg REST catalog endpoint:
Create a file named oss_spark_customer_etl.py
in your environment with the following content:
Launch Pyspark locally and validate read/write to the Iceberg table on Amazon S3
Run pip install pyspark. Save the script locally and set the environment variables (AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, and AWS_SESSION_TOKEN
) with temporary credentials for the spark_role
IAM role.
Run python /path/to/oss_spark_customer_etl.py
You can also use Athena to view the data in the Iceberg table:
To enable the other data team to view the content, provide read access to the data team IAM role using the Lake Formation console:
- In the Lake Formation console navigation pane, choose Data lake permissions, and then choose Grant.
- In the Principals section, for IAM users and roles choose <iam_role>.
- In the LF-Tags or catalog resources section, select Named Data Catalog resources:
- Select <accountid> for Catalogs.
- Select ossblogdb for Databases.
- Select customer for Tables.
- Select DESCRIBE and SELECT for Table permissions.
- Choose Grant.
Sign in as the IAM role and run the command:
Clean up
To clean up your resources, complete the following steps:
Conclusion
In this post, we’ve walked through the seamless integration between Apache Spark and an AWS Glue Iceberg Rest Catalog for accessing Iceberg tables in Amazon S3, demonstrating how to effectively perform read and write operations using Iceberg REST API. The beauty of this solution lies in its flexibility—whether you’re running Spark on bare metal servers in your data center, in a Kubernetes cluster, or any other environment, this architecture can be adapted to suit your needs.
About the Authors
Raj Ramasubbu is a Sr. Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 20 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.
Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She works with product team and customer to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.
Pratik Das is a Senior Product Manager with AWS Lake Formation. He is passionate about all things data and works with customers to understand their requirements and build delightful experiences. He has a background in building data-driven solutions and machine learning systems in production.