
In the ever-evolving landscape of cloud computing and data management, AWS has consistently been at the forefront of innovation. One of the groundbreaking developments in recent years is zero-ETL integration, a set of fully managed integrations by AWS that minimizes the need to build extract, transform, and load (ETL) data pipelines. This post will explore brief history of zero-ETL, its importance for customers, and introduce an exciting new feature: history mode for Amazon Aurora PostgreSQL-Compatible Edition, Amazon Aurora MySQL-Compatible Edition, Amazon Relational Database Service (Amazon RDS) for MySQL, and Amazon DynamoDB zero-ETL integration with Amazon Redshift.
A brief history of zero-ETL integrations
The concept of zero-ETL integrations emerged as a response to the growing complexities and inefficiencies in traditional ETL processes. Traditional ETL processes are time-consuming and complex to develop, maintain, and scale. Although not all use cases can be replaced with zero-ETL, it simplifies the replication and allows you to apply transformation post-replication. This eliminates the need for additional ETL technology between the source database and Amazon Redshift. We at AWS recognized the need for a more streamlined approach to data integration, particularly between operational databases and the cloud data warehouses. The journey of zero-ETL began in late 2022 when we introduced the feature for Aurora MySQL with Amazon Redshift. This feature marked a pivotal moment in streamlining complex data workflows, enabling near real-time data replication and analysis while eliminating the need for ETL processes.
Building on the success of our first zero-ETL integration, we’ve made continuous strides in this space by working backward from our customers’ needs and launching features like data filtering, auto and incremental refresh of materialized views, refresh interval, and more. Furthermore, we increased the breadth of sources to include Aurora PostgreSQL, DynamoDB, and Amazon RDS for MySQL to Amazon Redshift integrations, solidifying our commitment to making it seamless for you to run analytics on your data. The introduction of zero-ETL was not just a technological advancement; it represented a paradigm shift in how organizations could approach their data strategies. By removing the need for intermediate data processing steps, we opened up new possibilities for near real-time analytics and decision-making.
Introducing history mode: A new frontier in data analysis
Zero-ETL has already simplified the data integration, and we’re excited to further enhance the capabilities by announcing a new feature that takes it a step further: history mode with Amazon Redshift. Using history mode with zero-ETL integrations, you can streamline your historical data analysis by maintaining full change data capture (CDC) from the source in Amazon Redshift. History mode enables you to unlock the full potential of your data by seamlessly capturing and retaining historical versions of records across your zero-ETL data sources. You can perform advanced historical analysis, build look back reports, perform trend analysis, and create slowly changing dimensions (SCD) Type 2 tables on Amazon Redshift. This allows you to consolidate your core analytical assets and derive insights across multiple applications, gaining cost savings and operational efficiencies. History mode enables organizations to comply with regulatory requirements for maintaining historical records, facilitating comprehensive data governance and informed decision-making.
Zero-ETL integrations provide a current view of records in near real time, meaning only the latest changes from source databases are retained on Amazon Redshift. With history mode, Amazon Redshift introduces a revolutionary approach to historical data analysis. You can now configure your zero-ETL integrations to track every version of your records in source tables directly in Amazon Redshift, along with the source timestamp with every record version indicating when each record was inserted, modified, or deleted. Because data changes are tracked and retained by Amazon Redshift, this can help you meet your compliance requirements without having to maintain duplicate copies in data sources. In addition, you don’t have to maintain and manage partitioned tables to keep older data intact as separate partitions to version records, and maintain historical data in source databases.
In a data warehouse, the most common dimensional modeling techniques is a star schema, where there is a fact table at the center surrounded by a number of associated dimension tables. A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. To illustrate an example, in a typical sales domain, customer, time, or product are dimensions and sales transactions is a fact. An SCD is a data warehousing concept that contains relatively static data that can change slowly over a period of time. There are three major types of SCDs maintained in data warehousing: Type 1 (no history), Type 2 (full history), and Type 3 (limited history). CDC is a characteristic of a database that provides an ability to identify the data that changed between two database loads, so that an action can be performed on the changed data.
In this post, we demonstrate how to enable history mode for tables in a zero-ETL integration and capture the full historical data changes as SCD2.
Solution overview
In this use case, we explore how a fictional nationwide retail chain, AnyCompany, uses AWS services to gain valuable insights into their customer base. With multiple locations across the country, AnyCompany aims to enhance their understanding of customer behavior and improve their marketing strategies through two key initiatives:
- Customer migration analysis – AnyCompany seeks to track and analyze customer relocation patterns, focusing on how geographical moves impact purchasing behavior. By monitoring these changes, the company can adapt its inventory, services, and local marketing efforts to better serve customers in their new locations.
- Marketing campaign effectiveness – The retailer wants to evaluate the impact of targeted marketing campaigns based on customer demographics at the time of campaign execution. This analysis can help AnyCompany refine its marketing strategies, optimize resource allocation, and improve overall campaign performance.
By closely tracking changes in customer profiles for both geographic movement and marketing responsiveness, AnyCompany is positioning itself to make more informed, data-driven decisions.
In this demonstration, we begin by loading a sample dataset into the source table, customer, in Aurora PostgreSQL-Compatible. To maintain historical records, we enable history mode on the customer table, which automatically tracks changes in Amazon Redshift.
When history mode is turned on, the following columns are automatically added to the target table, customer, in Amazon Redshift to keep track of changes in the source.
Column name | Data type | Description |
_record_is_active |
Boolean | Indicates if a record in the target is currently active in the source. True indicates the record is active. |
_record_create_time |
Timestamp | Starting time (UTC) when the source record is active. |
_record_delete_time |
Timestamp | Ending time (UTC) when the source record is updated or deleted. |
Next, we create a dimension table, customer_dim
, in Amazon Redshift with an additional surrogate key column to show an example of creating an SCD table. To optimize query performance for different queries, some of which might be analyzing active or inactive records only while other queries might be analyzing data as of a certain date, we defined the sort key consisting of _record_is_active
, _record_create_time
, and _record_delete_time
attributes in the customer_dim table.
The following figure provides the schema of the source table in Aurora PostgreSQL-Compatible, and the target table and target customer dimension table in Amazon Redshift.
To streamline the data population process, we developed a stored procedure named SP_Customer_Type2_SCD()
. This procedure is designed to populate incremental data into the customer_dim
table from the replicated customer
table. It handles various data changes, including updates, inserts, and deletes in the source table and implementing an SCD2 approach.
Prerequisites
Before you get started, complete the following steps:
- Configure your Aurora DB cluster and your Redshift data warehouse with the required parameters and permissions. For instructions, refer to Getting started with Aurora zero-ETL integrations with Amazon Redshift.
- Create an Aurora zero-ETL integration with Amazon Redshift.
- From an Amazon Elastic Compute Cloud (Amazon EC2) terminal or using AWS CloudShell, SSH into the Aurora PostgreSQL cluster and run the following commands to install psql:
- Load the sample source data:
- Download the TPC-DS sample dataset for the
customer
table onto the machine running psql. - From the EC2 terminal, run the following command to connect to the Aurora PostgreSQL DB using the default super user
postgres
: - Run the following SQL command to create the database
zetl
: - Change the connection to the newly created database:
- Create the
customer
table (the following example creates it in the public schema): - Run the following command to load customer data from the downloaded dataset after changing the highlighted location of the dataset to your directory path:
- Run the following query to validate the successful creation of the table and loading of sample data:
- Download the TPC-DS sample dataset for the
The SQL output should be as follows:
Create a target database in Amazon Redshift
To replicate data from your source into Amazon Redshift, you must create a target database from your integration in Amazon Redshift. For this post, we have already created a source database called zetl
in Aurora PostgreSQL-Compatible as part of the prerequisites. Complete the following steps to create the target database:
- On the Amazon Redshift console, choose Query editor v2 in the navigation pane.
- Run the following commands to create a database called
postgres
in Amazon Redshift using the zero-ETLintegration_id
with history mode turned on.
History mode turned on at the time of target database creation on Amazon Redshift will enable history mode for existing and new tables created in the future.
- Run the following query to validate the successful replication of the initial data from the source into Amazon Redshift:
The table customer
should show table_state
as Synced
with is_history_mode
as true
.
Enable history mode for existing zero-ETL integrations
History mode can be enabled for your existing zero-ETL integrations using either the Amazon Redshift console or SQL commands. Based on your use case, you can turn on history mode at the database, schema, or table level. To use the Amazon Redshift console, complete the following steps:
- On the Amazon Redshift console, choose Zero-ETL integrations in the navigation pane.
- Choose your desired integration.
- Choose Manage history mode.
On this page, you can either enable or disable history mode for all tables or a subset of tables.
- Select Manage history mode for individual tables and select Turn on for the history mode for the
customer
- Choose Save changes.
- To confirm changes, choose Table statistics and make sure History mode is On for the
customer
. - Optionally, you can run the following SQL command in Amazon Redshift to enable history mode for the
customer
table:
- Optionally, you can enable history mode for all current and tables created in the future in the database:
- Optionally, you can enable history mode for all current and tables created in the future in one or more schemas. The following query enables history mode for all current and tables created in the future for the
public
schema:
- Run the following query to validate if the
customer
table has been successfully changed to history mode with theis_history_mode
column astrue
so that it can begin tracking every version (including updates and deletes) of all records changed in the source:
Initially, the table will be in ResyncInitiated
state before changing to Synced
.
- Run the following query in the
zetl
database of Aurora PostgreSQL-Compatible to modify a source record and observe the behavior of history mode in the Amazon Redshift target:
- Now run the following query in the
postgres
database of Amazon Redshift to see all versions of the same record:
Zero-ETL integrations with history mode has inactivated the old record with the _record_is_active
column value to false
and created a new record with _record_is_active
as true
. You can also see how it maintains the _record_create_time
and _record_delete_time
column values for both records. The inactive record has a delete timestamp that matches the active record’s create timestamp.
Load incremental data in an SCD2 table
Complete the following steps to create an SCD2 table and implement an incremental data load process in a regular database of Amazon Redshift, in this case dev:
- Create an empty customer SDC2 table called
customer_dim
with SCD fields. The table also has DISTSTYLEAUTO
and SORTKEY columns_record_is_active
,_record_create_time
, and_record_delete_time
. When you define a sort key on a table, Amazon Redshift can skip reading entire blocks of data for that column. It can do so because it tracks the minimum and maximum column values stored on each block and can skip blocks that don’t apply to the predicate range.
Next, you create a stored procedure called SP_Customer_Type2_SCD()
to populate incremental data in the customer_dim
SCD2 table created in the preceding step. The stored procedure contains the following components:
-
- First, it fetches the max
_record_create_time
and max_record_delete_time
for eachcustomer_id
. - Then, it compares the output of the preceding step with the ongoing zero-ETL integration replicated table for records created after the max creation time in the dimension table or the record in the replicated table with
_record_delete_time
after the max_record_delete_time
in the dimension table for eachcustomer_id
. - The output of the preceding step captures the changed data between the replicated
customer
table and targetcustomer_dim
dimension table. The interim data is staged to acustomer_stg
table, which is ready to be merged with the target table. - During the merge process, records that need to be deleted are marked with
_record_delete_time
and_record_is_active
is set tofalse
, whereas newly created records are inserted into the target tablecustomer_dim
with_record_is_active
astrue
.
- First, it fetches the max
- Create the stored procedure with the following code:
- Run and schedule the stored procedure to load the initial and ongoing incremental data into the
customer_dim
SCD2 table:
- Validate the data in the
customer_dim
table for the same customer with a changed address:
You have successfully implemented an incremental load strategy for the customer SCD2 table. Going forward, all changes to customer will be tracked and maintained in this customer dimension table by running the stored procedure. This enables you to analyze customer data at a desired point in time for varying use cases, for example, performing customer migration analysis and seeing how geographical moves impact purchasing behavior, or marketing campaign effectiveness to analyze the impact of targeted marketing campaigns on customer demographics at the time of campaign execution.
Industry use cases for history mode
The following are other industry use cases enabled by history mode between operational data stores and Amazon Redshift:
- Financial auditing or regulatory compliance – Track changes in financial records over time to support compliance and audit requirements. History mode allows auditors to reconstruct the state of financial data at any point in time, which is crucial for investigations and regulatory reporting.
- Customer journey analysis – Understand how customer data evolves to gain insights into behavior patterns and preferences. Marketers can analyze how customer profiles change over time, informing personalization strategies and lifetime value calculations.
- Supply chain optimization – Analyze historical inventory and order data to identify trends and optimize stock levels. Supply chain managers can review how demand patterns have shifted over time, improving forecasting accuracy.
- HR analytics – Track employee data changes over time for better workforce planning and performance analysis. HR professionals can analyze career progression, salary changes, and skill development trends across the organization.
- Machine learning model auditing – Data scientists can use historical data to train models, compare predictions vs. actuals to improve accuracy, and help explain model behavior and identify potential biases over time.
- Hospitality and airline industry use cases – For example:
- Customer service – Access historical reservation data to swiftly address customer queries, enhancing service quality and customer satisfaction.
- Crew scheduling – Track crew schedule changes to help comply with union contracts, maintaining positive labor relations and optimizing workforce management.
- Data science applications – Use historical data to train models on multiple scenarios from different time periods. Compare predictions against actuals to improve model accuracy for key operations such as airport gate management, flight prioritization, and crew scheduling optimization.
Best practices
If your requirement is to separate active and inactive records, you can use _record_is_active
as the first sort key. For other patterns where you want to analyze data as of a specific date in the past, irrespective of whether data is active or inactive, _record_create_time
and _record_delete_time
can be added as sort keys.
History mode retains record versions, which will increase the table size in Amazon Redshift and could impact query performance. Therefore, periodically perform DML deletes for outdated record versions (delete data beyond a certain timeframe if not needed for analysis). When executing these deletions, maintain data integrity by deleting across all related tables. Vacuuming also becomes necessary when you perform DML deletes on records whose versioning is no longer required. To improve auto vacuum delete efficiency, Amazon Redshift auto vacuum delete is more efficient when operating on bulk deletes. You can monitor vacuum progression using the SYS_VACUUM_HISTORY
table.
Clean up
Complete the following steps to clean up your resources:
Conclusion
Zero-ETL integrations have already made significant strides in simplifying data integration and enabling near real-time analytics. With the addition of history mode, AWS continues to innovate, providing you with even more powerful tools to derive value from your data.
As businesses increasingly rely on data-driven decision-making, zero-ETL with history mode will be crucial in maintaining a competitive edge in the digital economy. These advancements not only streamline data processes but also open up new avenues for analysis and insight generation.
To learn more about zero-ETL integration with history mode, refer to Zero-ETL integrations and Limitations. Get started with zero-ETL on AWS by creating a free account today!
About the Authors
Raks Khare is a Senior Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers across varying industries and regions architect data analytics solutions at scale on the AWS platform. Outside of work, he likes exploring new travel and food destinations and spending quality time with his family.
Jyoti Aggarwal is a Product Management Lead for AWS zero-ETL. She leads the product and business strategy, including driving initiatives around performance, customer experience, and security. She brings along an expertise in cloud compute, data pipelines, analytics, artificial intelligence (AI), and data services including databases, data warehouses and data lakes.
Gopal Paliwal is a Principal Engineer for Amazon Redshift, leading the software development of ZeroETL initiatives for Amazon Redshift.
Harman Nagra is a Principal Solutions Architect at AWS, based in San Francisco. He works with global financial services organizations to design, develop, and optimize their workloads on AWS.
Sumanth Punyamurthula is a Senior Data and Analytics Architect at Amazon Web Services with more than 20 years of experience in leading large analytical initiatives, including analytics, data warehouse, data lakes, data governance, security, and cloud infrastructure across travel, hospitality, financial, and healthcare industries.