In February 2024, we announced the release of the Data Solutions Framework (DSF), an opinionated open source framework for building data solutions on AWS. DSF is built using the AWS Cloud Development Kit (AWS CDK) to package infrastructure components into L3 AWS CDK constructs on top of AWS services. L3 constructs are implementations of common technical patterns and create multiple resources that are configured to work with each other.
In this post, we demonstrate how to use the AWS CDK and DSF to create a multi-data warehouse platform based on Amazon Redshift Serverless. DSF simplifies the provisioning of Redshift Serverless, initialization and cataloging of data, and data sharing between different data warehouse deployments. Using a programmatic approach with the AWS CDK and DSF allows you to apply GitOps principles to your analytics workloads and realize the following benefits:
- You can deploy using continuous integration and delivery (CI/CD) pipelines, including the definitions of Redshift objects (databases, tables, shares, and so on)
- You can roll out changes consistently across multiple environments
- You can bootstrap data warehouses (table creation, ingestion of data, and so on) using code and use version control to simplify the setup of testing environments
- You can test changes before deployment using AWS CDK built-in testing capabilities
In addition, DSF’s Redshift Serverless L3 constructs provide a number of built-in capabilities that can accelerate development while helping you follow best practices. For example:
- Running extract, transform, and load (ETL) jobs to and from Amazon Redshift is more straightforward because an AWS Glue connection resource is automatically created and configured. This means data engineers don’t have to configure this resource and can use it right away with their AWS Glue ETL jobs.
- Similarly, with discovery of data inside Amazon Redshift, DSF provides a convenient method to configure an AWS Glue crawler to populate the AWS Glue Data Catalog for ease of discovery as well as ease of referencing tables when creating ETL jobs. The configured AWS Glue crawler uses an AWS Identity and Access Management (IAM) role that follows least privilege.
- Sharing data between Redshift data warehouses is a common approach to improve collaboration between lines of business without duplicating data. DSF provides convenient methods for the end-to-end flow for both data producer and consumer.
Solution overview
The solution demonstrates a common pattern where a data warehouse is used as a serving layer for business intelligence (BI) workloads on top of data lake data. The source data is stored in Amazon Simple Storage Service (Amazon S3) buckets, then ingested into a Redshift producer data warehouse to create materialized views and aggregate data, and finally shared with a Redshift consumer running BI queries from the end-users. The following diagram illustrates the high-level architecture.
In the post, we use Python for the example code. DSF also supports TypeScript.
Prerequisites
Because we’re using the AWS CDK, complete the steps in Getting Started with the AWS CDK before you implement the solution.
Initialize the project and provision a Redshift Serverless namespace and workgroup
Let’s start with initializing the project and including DSF as a dependency. You can run this code in your local terminal, or you can use AWS Cloud9:
Open the project folder in your IDE and complete the following steps:
- Open the
app.py
file. - In this file, make sure to uncomment the first
env
This configures the AWS CDK environment depending on the AWS profile used during the deployment. - Add a configuration flag in the
cdk.context.json
file at the root of the project (if it doesn’t exist, create the file):
Setting the @data-solutions-framework-on-aws/removeDataOnDestroy
configuration flag to true makes sure resources that have the removal_policy
parameter set to RemovalPolicy.DESTROY
are destroyed when the AWS CDK stack is deleted. This is a guardrail DSF uses to prevent accidentally deleting data.
Now that the project is configured, you can start adding resources to the stack.
- Navigate to the
dsf_redshift_blog
folder and open thedsf_redshift_blog_stack.py
file.
This is where we configure the resources to be deployed.
- To get started building the end-to-end demo, add the following import statements at the top of the file, which allows you to start defining the resources from both the AWS CDK core library as well as DSF:
We use several DSF-specific constructs to build the demo:
- DataLakeStorage – This creates three S3 buckets, named
Bronze
,Silver
, andGold
, to represent the different data layers. - S3DataCopy – This manages the copying of data from one bucket to another bucket.
- RedshiftServerlessNamespace – This creates a Redshift Serverless namespace where database objects and users are stored.
- RedshiftServerlessWorkgroup – This creates a Redshift Serverless workgroup that contains compute- and network-related configurations for the data warehouse. This is also the entry point for several convenient functionalities that DSF provides, such as cataloging of Redshift tables, running SQL statements as part of the AWS CDK (such as creating tables, data ingestion, merging of tables, and more), and sharing datasets across different Redshift clusters without moving data.
- Now that you have imported the libraries, create a set of S3 buckets following the medallion architecture best practices with bronze, silver, and gold data layers.
The high-level definitions of each layer are as follows:
- Bronze represents raw data; this is where data from various source systems lands. No schema is needed.
- Silver is cleaned and potentially augmented data. The schema is enforced in this layer.
- Gold is data that’s further refined and aggregated to serve a specific business need.
Using the DataLakeStorage construct, you can create these three S3 buckets with the following best practices:
- Encryption at rest through AWS Key Management Service (AWS KMS) is turned on
- SSL is enforced
- The use of S3 bucket keys is turned on
- There’s a default S3 lifecycle rule defined to delete incomplete multipart uploads after 1 day
- After you create the S3 buckets, copy over the data using the S3DataCopy For this demo, we land the data in the
Silver
bucket because it’s already cleaned: - In order for Amazon Redshift to ingest the data in Amazon S3, it needs an IAM role with the right permissions. This role will be associated with the Redshift Serverless namespace that you create next.
- To provision Redshift Serverless, configure two resources: a namespace and a workgroup. DSF provides L3 constructs for both:
Both constructs follow security best practices, including:
- The default virtual private cloud (VPC) uses private subnets (with public access disabled).
- Data is encrypted at rest through AWS KMS with automatic key rotation.
- Admin credentials are stored in AWS Secrets Manager with automatic rotation managed by Amazon Redshift.
- A default AWS Glue connection is automatically created using private connectivity. This can be used by AWS Glue crawlers as well as AWS Glue ETL jobs to connect to Amazon Redshift.
The
RedshiftServerlessWorkgroup
construct is the main entry point for other capabilities, such as integration with the AWS Glue Data Catalog, Redshift Data API, and Data Sharing API.- In the following example, use the defaults provided by the construct and associate the IAM role that you created earlier to give Amazon Redshift access to the data lake for data ingestion:
Create tables and ingest data
To create a table, you can use the runCustomSQL
method in the RedshiftServerlessWorkgroup
construct. This method allows you to run arbitrary SQL statements when the resource is being created (such as create table
or create materialized view
) and when it’s being deleted (such as drop table
or drop materialized view
).
Add the following code after the RedshiftServerlessWorkgroup
instantiation:
Given the asynchronous nature of some of the resource creation, we also enforce dependencies between some resources; otherwise, the AWS CDK would try to create them in parallel to accelerate the deployment. The preceding dependency statements establish the following:
- Before you load the data, the S3 data copy is complete, so the data exists in the source bucket of the ingestion
- Before you load the data, the target table has been created in the Redshift namespace
Bootstrapping example (materialized views)
The workgroup.run_custom_sql()
method provides flexibility in how you can bootstrap your Redshift data warehouse using the AWS CDK. For example, you can create a materialized view to improve the queries’ performance by pre-aggregating data from the Amazon reviews:
materialized_view = workgroup.run_custom_sql('MvProductAnalysis',
database_name="defaultdb",
sql=f'''CREATE MATERIALIZED VIEW mv_product_analysis AS SELECT review_date, product_title, COUNT(1) AS review_total, SUM(star_rating) AS rating FROM amazon_reviews WHERE marketplace="US" GROUP BY 1,2;''',
delete_sql="drop materialized view mv_product_analysis")
materialized_view.node.add_dependency(load_amazon_reviews_data)
Catalog tables in Amazon Redshift
The deployment of RedshiftServerlessWorkgroup
automatically creates an AWS Glue connection resource that can be used by AWS Glue crawlers and AWS Glue ETL jobs. This is directly exposed from the workgroup construct through the glue_connection property. Using this connection, the workgroup construct exposes a convenient method to catalog the tables inside the associated Redshift Serverless namespace. The following an example code:
This single line of code creates a database in the Data Catalog named mv_product_analysis
and the associated crawler with the IAM role and network configuration already configured. By default, it crawls all the tables inside the public schema in the default database indicated when the Redshift Serverless namespace was created. To override this, the third parameter in the catalogTables
method allows you to define a pattern on what to crawl (see the JDBC data store in the include path).
You can run the crawler using the AWS Glue console or invoke it using the SDK, AWS Command Line Interface (AWS CLI), or AWS CDK using AwsCustomResource.
Data sharing
DSF supports Redshift data sharing for both sides (producers and consumers) as well as same account and cross-account scenarios. Let’s create another Redshift Serverless namespace and workgroup to demonstrate the interaction:
For producers
For producers, complete the following steps:
- Create the new share and populate the share with the schema or tables:
data_share = workgroup.create_share('DataSharing', 'defaultdb', 'defaultdbshare', 'public', ['mv_product_analysis']) data_share.new_share_custom_resource.node.add_dependency(materialized_view)
- Create access grants:
-
- To grant to a cluster in the same account:
share_grant = workgroup.grant_access_to_share("GrantToSameAccount", data_share, namespace2.namespace_id) share_grant.resource.node.add_dependency(data_share.new_share_custom_resource) share_grant.resource.node.add_dependency(namespace2)
- To grant to a different account:
- To grant to a cluster in the same account:
The last parameter in the grant_access_to_share
method allows to automatically authorize the cross-account access on the data share. Omitting this parameter would default to no authorization; which means a Redshift administrator needs to authorize the cross-account share either using the AWS CLI, SDK, or Amazon Redshift console.
For consumers
For the same account share, to create the database from the share, use the following code:
For cross-account grants, the syntax is similar, but you need to indicate the producer account ID:
To see the full working example, follow the instructions in the accompanying GitHub repository.
Deploy the resources using the AWS CDK
To deploy the resources, run the following code:
You can review the resources created, as shown in the following screenshot.
Confirm the changes for the deployment to start. Wait a few minutes for the project to be deployed; you can keep track of the deployment using the AWS CLI or the AWS CloudFormation console.
When the deployment is complete, you should see two Redshift workgroups (one producer and one consumer).
Using Amazon Redshift Query Editor v2, you can log in to the producer Redshift workgroup using Secrets Manager, as shown in the following screenshot.
After you log in, you can see the tables and views that you created using DSF in the defaultdb
database.
Log in to the consumer Redshift workgroup to see the shared dataset from the producer Redshift workgroup under the marketing database.
Clean up
You can run cdk destroy
in your local terminal to delete the stack. Because you marked the constructs with a RemovalPolicy.DESTROY
and configured DSF to remove data on destroy, running cdk destroy
or deleting the stack from the AWS CloudFormation console will clean up the provisioned resources.
Conclusion
In this post, we demonstrated how to use the AWS CDK along with the DSF to manage Redshift Serverless as code. Codifying the deployment of resources helps provide consistency across multiple environments. Aside from infrastructure, DSF also provides capabilities to bootstrap (table creation, ingestion of data, and more) Amazon Redshift and manage objects, all from the AWS CDK. This means that changes can be version controlled, reviewed, and even unit tested.
In addition to Redshift Serverless, DSF supports other AWS services, such as Amazon Athena, Amazon EMR, and many more. Our roadmap is publicly available, and we look forward to your feature requests, contributions, and feedback.
You can get started using DSF by following our quick start guide.
About the authors
Jan Michael Go Tan is a Principal Solutions Architect for Amazon Web Services. He helps customers design scalable and innovative solutions with the AWS Cloud.