Amazon Redshift announces history mode for zero-ETL integrations to simplify historical data tracking and analysis

In the ever-evolving landscape of cloud computing and data management, AWS has consistently been at the forefront of innovation. One of the groundbreaking developments in recent years is zero-ETL integration, a set of fully managed integrations by AWS that minimizes the need to build extract, transform, and load (ETL) data pipelines. This post will explore brief history of zero-ETL, its importance for customers, and introduce an exciting new feature: history mode for Amazon Aurora PostgreSQL-Compatible Edition, Amazon Aurora MySQL-Compatible Edition, Amazon Relational Database Service (Amazon RDS) for MySQL, and Amazon DynamoDB zero-ETL integration with Amazon Redshift.

A brief history of zero-ETL integrations

The concept of zero-ETL integrations emerged as a response to the growing complexities and inefficiencies in traditional ETL processes. Traditional ETL processes are time-consuming and complex to develop, maintain, and scale. Although not all use cases can be replaced with zero-ETL, it simplifies the replication and allows you to apply transformation post-replication. This eliminates the need for additional ETL technology between the source database and Amazon Redshift. We at AWS recognized the need for a more streamlined approach to data integration, particularly between operational databases and the cloud data warehouses. The journey of zero-ETL began in late 2022 when we introduced the feature for Aurora MySQL with Amazon Redshift. This feature marked a pivotal moment in streamlining complex data workflows, enabling near real-time data replication and analysis while eliminating the need for ETL processes.

Building on the success of our first zero-ETL integration, we’ve made continuous strides in this space by working backward from our customers’ needs and launching features like data filtering, auto and incremental refresh of materialized views, refresh interval, and more. Furthermore, we increased the breadth of sources to include Aurora PostgreSQL, DynamoDB, and Amazon RDS for MySQL to Amazon Redshift integrations, solidifying our commitment to making it seamless for you to run analytics on your data. The introduction of zero-ETL was not just a technological advancement; it represented a paradigm shift in how organizations could approach their data strategies. By removing the need for intermediate data processing steps, we opened up new possibilities for near real-time analytics and decision-making.

Introducing history mode: A new frontier in data analysis

Zero-ETL has already simplified the data integration, and we’re excited to further enhance the capabilities by announcing a new feature that takes it a step further: history mode with Amazon Redshift. Using history mode with zero-ETL integrations, you can streamline your historical data analysis by maintaining full change data capture (CDC) from the source in Amazon Redshift. History mode enables you to unlock the full potential of your data by seamlessly capturing and retaining historical versions of records across your zero-ETL data sources. You can perform advanced historical analysis, build look back reports, perform trend analysis, and create slowly changing dimensions (SCD) Type 2 tables on Amazon Redshift. This allows you to consolidate your core analytical assets and derive insights across multiple applications, gaining cost savings and operational efficiencies. History mode enables organizations to comply with regulatory requirements for maintaining historical records, facilitating comprehensive data governance and informed decision-making.

Zero-ETL integrations provide a current view of records in near real time, meaning only the latest changes from source databases are retained on Amazon Redshift. With history mode, Amazon Redshift introduces a revolutionary approach to historical data analysis. You can now configure your zero-ETL integrations to track every version of your records in source tables directly in Amazon Redshift, along with the source timestamp with every record version indicating when each record was inserted, modified, or deleted. Because data changes are tracked and retained by Amazon Redshift, this can help you meet your compliance requirements without having to maintain duplicate copies in data sources. In addition, you don’t have to maintain and manage partitioned tables to keep older data intact as separate partitions to version records, and maintain historical data in source databases.

In a data warehouse, the most common dimensional modeling techniques is a star schema, where there is a fact table at the center surrounded by a number of associated dimension tables. A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. To illustrate an example, in a typical sales domain, customer, time, or product are dimensions and sales transactions is a fact. An SCD is a data warehousing concept that contains relatively static data that can change slowly over a period of time. There are three major types of SCDs maintained in data warehousing: Type 1 (no history), Type 2 (full history), and Type 3 (limited history). CDC is a characteristic of a database that provides an ability to identify the data that changed between two database loads, so that an action can be performed on the changed data.

In this post, we demonstrate how to enable history mode for tables in a zero-ETL integration and capture the full historical data changes as SCD2.

Solution overview

In this use case, we explore how a fictional nationwide retail chain, AnyCompany, uses AWS services to gain valuable insights into their customer base. With multiple locations across the country, AnyCompany aims to enhance their understanding of customer behavior and improve their marketing strategies through two key initiatives:

Customer migration analysis – AnyCompany seeks to track and analyze customer relocation patterns, focusing on how geographical moves impact purchasing behavior. By monitoring these changes, the company can adapt its inventory, services, and local marketing efforts to better serve customers in their new locations.
Marketing campaign effectiveness – The retailer wants to evaluate the impact of targeted marketing campaigns based on customer demographics at the time of campaign execution. This analysis can help AnyCompany refine its marketing strategies, optimize resource allocation, and improve overall campaign performance.

By closely tracking changes in customer profiles for both geographic movement and marketing responsiveness, AnyCompany is positioning itself to make more informed, data-driven decisions.

In this demonstration, we begin by loading a sample dataset into the source table, customer, in Aurora PostgreSQL-Compatible. To maintain historical records, we enable history mode on the customer table, which automatically tracks changes in Amazon Redshift.

When history mode is turned on, the following columns are automatically added to the target table, customer, in Amazon Redshift to keep track of changes in the source.

Column name	Data type	Description
`_record_is_active`	Boolean	Indicates if a record in the target is currently active in the source. True indicates the record is active.
`_record_create_time`	Timestamp	Starting time (UTC) when the source record is active.
`_record_delete_time`	Timestamp	Ending time (UTC) when the source record is updated or deleted.

Next, we create a dimension table, customer_dim, in Amazon Redshift with an additional surrogate key column to show an example of creating an SCD table. To optimize query performance for different queries, some of which might be analyzing active or inactive records only while other queries might be analyzing data as of a certain date, we defined the sort key consisting of _record_is_active, _record_create_time, and _record_delete_time attributes in the customer_dim table.

The following figure provides the schema of the source table in Aurora PostgreSQL-Compatible, and the target table and target customer dimension table in Amazon Redshift.

To streamline the data population process, we developed a stored procedure named SP_Customer_Type2_SCD(). This procedure is designed to populate incremental data into the customer_dim table from the replicated customer table. It handles various data changes, including updates, inserts, and deletes in the source table and implementing an SCD2 approach.

Prerequisites

Before you get started, complete the following steps:

Configure your Aurora DB cluster and your Redshift data warehouse with the required parameters and permissions. For instructions, refer to Getting started with Aurora zero-ETL integrations with Amazon Redshift.
Create an Aurora zero-ETL integration with Amazon Redshift.
From an Amazon Elastic Compute Cloud (Amazon EC2) terminal or using AWS CloudShell, SSH into the Aurora PostgreSQL cluster and run the following commands to install psql:

sudo dnf install postgresql15
psql --version

Load the sample source data:

Download the TPC-DS sample dataset for the customer table onto the machine running psql.
From the EC2 terminal, run the following command to connect to the Aurora PostgreSQL DB using the default super user postgres:
```
psql -h <RDS Write Instance Endpoint> -p 5432 -U postgres
```
Run the following SQL command to create the database zetl:
```
create database zetl template template1;
```
Change the connection to the newly created database:

Create the customer table (the following example creates it in the public schema):

CREATE TABLE customer(
    c_customer_id char(16) NOT NULL PRIMARY KEY,
    c_salutation char(10),
    c_first_name char(20),
    c_last_name char(30),
    c_preferred_cust_flag char(1),
    c_birth_day int4,
    c_birth_month int4,
    c_birth_year int4,
    c_birth_country varchar(20),
    c_login char(13),
    c_email_address char(50),
    ca_street_number char(10),
    ca_street_name varchar(60),
    ca_street_type char(15),
    ca_suite_number char(10),
    ca_city varchar(60),
    ca_county varchar(30),
    ca_state char(2),
    ca_zip char(10),
    ca_country varchar(20),
    ca_gmt_offset numeric(5, 2),
    ca_location_type char(20)
);

Run the following command to load customer data from the downloaded dataset after changing the highlighted location of the dataset to your directory path:
```
\copy customer from '/home/ec2-user/customer_sample_data.dat' WITH DELIMITER '|' CSV;
```

Run the following query to validate the successful creation of the table and loading of sample data:

SELECT table_catalog, table_schema, table_name, n_live_tup AS row_count
FROM information_schema.tables JOIN g_stat_user_tables ON table_name = relname
WHERE table_type="BASE TABLE"
ORDER BY row_count DESC;

The SQL output should be as follows:

table_catalog | table_schema | table_name | row_count
---------------+--------------+------------+-----------
zetl          | public       | customer   |   1200585
(1 row)

Create a target database in Amazon Redshift

To replicate data from your source into Amazon Redshift, you must create a target database from your integration in Amazon Redshift. For this post, we have already created a source database called zetl in Aurora PostgreSQL-Compatible as part of the prerequisites. Complete the following steps to create the target database:

On the Amazon Redshift console, choose Query editor v2 in the navigation pane.
Run the following commands to create a database called postgres in Amazon Redshift using the zero-ETL integration_id with history mode turned on.

-- Amazon Redshift SQL commands to create database
SELECT integration_id FROM svv_integration; -- copy this result, use in the next sql
CREATE DATABASE "postgres" FROM INTEGRATION '<result from above>' DATABASE "zetl" SET HISTORY_MODE = TRUE;

History mode turned on at the time of target database creation on Amazon Redshift will enable history mode for existing and new tables created in the future.

Run the following query to validate the successful replication of the initial data from the source into Amazon Redshift:

select is_history_mode, table_name, table_state, * from svv_integration_table_state;

The table customer should show table_state as Synced with is_history_mode as true.

Enable history mode for existing zero-ETL integrations

History mode can be enabled for your existing zero-ETL integrations using either the Amazon Redshift console or SQL commands. Based on your use case, you can turn on history mode at the database, schema, or table level. To use the Amazon Redshift console, complete the following steps:

On the Amazon Redshift console, choose Zero-ETL integrations in the navigation pane.
Choose your desired integration.
Choose Manage history mode.

On this page, you can either enable or disable history mode for all tables or a subset of tables.

Select Manage history mode for individual tables and select Turn on for the history mode for the customer
Choose Save changes.
To confirm changes, choose Table statistics and make sure History mode is On for the customer.
Optionally, you can run the following SQL command in Amazon Redshift to enable history mode for the customer table:

ALTER DATABASE "postgres" INTEGRATION SET HISTORY_MODE = TRUE FOR TABLE public.customer;

Optionally, you can enable history mode for all current and tables created in the future in the database:

ALTER DATABASE "postgres" INTEGRATION SET HISTORY_MODE = TRUE FOR ALL TABLES;

Optionally, you can enable history mode for all current and tables created in the future in one or more schemas. The following query enables history mode for all current and tables created in the future for the public schema:

ALTER DATABASE "postgres" INTEGRATION SET HISTORY_MODE = TRUE FOR ALL TABLES IN SCHEMA public;

Run the following query to validate if the customer table has been successfully changed to history mode with the is_history_mode column as true so that it can begin tracking every version (including updates and deletes) of all records changed in the source:

select is_history_mode, table_name, table_state, * from svv_integration_table_state;

Initially, the table will be in ResyncInitiated state before changing to Synced.

Run the following query in the zetl database of Aurora PostgreSQL-Compatible to modify a source record and observe the behavior of history mode in the Amazon Redshift target:

UPDATE customer
SET
    ca_suite_number="Suite 100",
    ca_street_number="500",
    ca_street_name="Main",
    ca_street_type="St.",
    ca_city = 'New York',
    ca_county = 'Manhattan',
    ca_state="NY",
    ca_zip = '10001'
WHERE c_customer_id = 'AAAAAAAAAAAKNAAA';

Now run the following query in the postgres database of Amazon Redshift to see all versions of the same record:

SELECT   
    c_customer_id,
    ca_street_number,
    ca_street_name,
    ca_suite_number,
    ca_city,
    ca_county,
    ca_state,
    ca_zip,
    _record_is_active,
    _record_create_time,
    _record_delete_time
FROM postgres.public.customer
WHERE c_customer_id = 'AAAAAAAAAAAKNAAA';

Zero-ETL integrations with history mode has inactivated the old record with the _record_is_active column value to false and created a new record with _record_is_active as true. You can also see how it maintains the _record_create_time and _record_delete_time column values for both records. The inactive record has a delete timestamp that matches the active record’s create timestamp.

Load incremental data in an SCD2 table

Complete the following steps to create an SCD2 table and implement an incremental data load process in a regular database of Amazon Redshift, in this case dev:

Create an empty customer SDC2 table called customer_dim with SCD fields. The table also has DISTSTYLE AUTO and SORTKEY columns _record_is_active, _record_create_time, and _record_delete_time. When you define a sort key on a table, Amazon Redshift can skip reading entire blocks of data for that column. It can do so because it tracks the minimum and maximum column values stored on each block and can skip blocks that don’t apply to the predicate range.

CREATE TABLE dev.public.customer_dim (
    c_customer_sk bigint NOT NULL DEFAULT 0 ENCODE raw distkey,
    c_customer_id character varying(19) DEFAULT '' :: character varying ENCODE lzo,
    c_salutation character varying(12) ENCODE bytedict,
    c_first_name character varying(24) ENCODE lzo,
    c_last_name character varying(36) ENCODE lzo,
    c_preferred_cust_flag character varying(1) ENCODE lzo,
    c_birth_day integer ENCODE az64,
    c_birth_month integer ENCODE az64,
    c_birth_year integer ENCODE az64,
    c_birth_country character varying(24) ENCODE bytedict,
    c_login character varying(15) ENCODE lzo,
    c_email_address character varying(60) ENCODE lzo,
    ca_street_number character varying(12) ENCODE lzo,
    ca_street_name character varying(72) ENCODE lzo,
    ca_street_type character varying(18) ENCODE bytedict,
    ca_suite_number character varying(12) ENCODE bytedict,
    ca_city character varying(72) ENCODE lzo,
    ca_county character varying(36) ENCODE lzo,
    ca_state character varying(2) ENCODE lzo,
    ca_zip character varying(12) ENCODE lzo,
    ca_country character varying(24) ENCODE lzo,
    ca_gmt_offset numeric(5, 2) ENCODE az64,
    ca_location_type character varying(24) ENCODE bytedict,
    _record_is_active boolean ENCODE raw,
    _record_create_time timestamp without time zone ENCODE az64,
    _record_delete_time timestamp without time zone ENCODE az64,
    PRIMARY KEY (c_customer_sk)
) SORTKEY (
    _record_is_active,
    _record_create_time,
    _record_delete_time
);

Next, you create a stored procedure called SP_Customer_Type2_SCD() to populate incremental data in the customer_dim SCD2 table created in the preceding step. The stored procedure contains the following components:

- First, it fetches the max _record_create_time and max _record_delete_time for each customer_id.
- Then, it compares the output of the preceding step with the ongoing zero-ETL integration replicated table for records created after the max creation time in the dimension table or the record in the replicated table with _record_delete_time after the max _record_delete_time in the dimension table for each customer_id.
- The output of the preceding step captures the changed data between the replicated customer table and target customer_dim dimension table. The interim data is staged to a customer_stg table, which is ready to be merged with the target table.
- During the merge process, records that need to be deleted are marked with _record_delete_time and _record_is_active is set to false, whereas newly created records are inserted into the target table customer_dim with _record_is_active as true.

Create the stored procedure with the following code:

CREATE OR REPLACE PROCEDURE public.sp_customer_type2_scd()
LANGUAGE plpgsql
AS $$
    BEGIN

    DROP TABLE IF EXISTS cust_latest;

    -- Create temp table with latest record timestamps
         CREATE TEMP TABLE cust_latest DISTKEY (c_customer_id) 
    AS
        SELECT
            c_customer_id,
            max(_record_create_time) AS _record_create_time,
            max(_record_delete_time) AS _record_delete_time
        FROM customer_dim 
        GROUP BY c_customer_id;
    
    DROP TABLE IF EXISTS customer_stg;

    -- Identify and stage changed records
    CREATE TEMP TABLE customer_stg 
    AS           
    SELECT
            ABS(fnv_hash(cust.c_customer_id)) as customer_sk,
            cust.*
            FROM
                postgres.public.customer cust
LEFT OUTER JOIN cust_latest ON cust.c_customer_id = cust_latest.c_customer_id
WHERE (cust._record_create_time > NVL(cust_latest._record_create_time, '1099-01-01 01:01:01') AND cust._record_is_active is true)
OR (cust._record_delete_time > NVL(cust_latest._record_delete_time, '1099-01-01 01:01:01') AND cust._record_is_active is false);

    -- Merge changes to customer dimension table
    MERGE INTO public.customer_dim 
    USING customer_stg stg 
    ON customer_dim.c_customer_id = stg.c_customer_id
        AND customer_dim._record_is_active = TRUE
        AND stg._record_is_active = false
    WHEN MATCHED THEN
        UPDATE
        SET
            _record_is_active = stg._record_is_active,
            _record_create_time = stg._record_create_time,
            _record_delete_time = stg._record_delete_time
    WHEN NOT MATCHED THEN
        INSERT
        VALUES
            (
                stg.customer_sk,
                stg.c_customer_id,
                stg.c_salutation,
                stg.c_first_name,
                stg.c_last_name,
                stg.c_preferred_cust_flag,
                stg.c_birth_day,
                 	     stg.c_birth_month,
                stg.c_birth_year,
                stg.c_birth_country,
                stg.c_login,
                stg.c_email_address,
                stg.ca_street_number,
                stg.ca_street_name,
                stg.ca_street_type,
                stg.ca_suite_number,
                stg.ca_city,
                stg.ca_county,
                stg.ca_state,
                stg.ca_zip,
                stg.ca_country,
                stg.ca_gmt_offset,
                stg.ca_location_type,
                stg._record_is_active,
                stg._record_create_time,
                stg._record_delete_time
            );

    END;
    $$

Run and schedule the stored procedure to load the initial and ongoing incremental data into the customer_dim SCD2 table:

CALL SP_Customer_Type2_SCD();

Validate the data in the customer_dim table for the same customer with a changed address:

SELECT
    c_customer_id,
    ca_street_number,
    ca_street_name,
    ca_suite_number,
    ca_city,
    ca_county,
    ca_state,
    ca_zip,
    _record_is_active,
    _record_create_time,
    _record_delete_time
FROM customer_dim
WHERE c_customer_id = 'AAAAAAAAAAAKNAAA';

You have successfully implemented an incremental load strategy for the customer SCD2 table. Going forward, all changes to customer will be tracked and maintained in this customer dimension table by running the stored procedure. This enables you to analyze customer data at a desired point in time for varying use cases, for example, performing customer migration analysis and seeing how geographical moves impact purchasing behavior, or marketing campaign effectiveness to analyze the impact of targeted marketing campaigns on customer demographics at the time of campaign execution.

Industry use cases for history mode

The following are other industry use cases enabled by history mode between operational data stores and Amazon Redshift:

Financial auditing or regulatory compliance – Track changes in financial records over time to support compliance and audit requirements. History mode allows auditors to reconstruct the state of financial data at any point in time, which is crucial for investigations and regulatory reporting.
Customer journey analysis – Understand how customer data evolves to gain insights into behavior patterns and preferences. Marketers can analyze how customer profiles change over time, informing personalization strategies and lifetime value calculations.
Supply chain optimization – Analyze historical inventory and order data to identify trends and optimize stock levels. Supply chain managers can review how demand patterns have shifted over time, improving forecasting accuracy.
HR analytics – Track employee data changes over time for better workforce planning and performance analysis. HR professionals can analyze career progression, salary changes, and skill development trends across the organization.
Machine learning model auditing – Data scientists can use historical data to train models, compare predictions vs. actuals to improve accuracy, and help explain model behavior and identify potential biases over time.
Hospitality and airline industry use cases – For example:
- Customer service – Access historical reservation data to swiftly address customer queries, enhancing service quality and customer satisfaction.
- Crew scheduling – Track crew schedule changes to help comply with union contracts, maintaining positive labor relations and optimizing workforce management.
- Data science applications – Use historical data to train models on multiple scenarios from different time periods. Compare predictions against actuals to improve model accuracy for key operations such as airport gate management, flight prioritization, and crew scheduling optimization.

Best practices

If your requirement is to separate active and inactive records, you can use _record_is_active as the first sort key. For other patterns where you want to analyze data as of a specific date in the past, irrespective of whether data is active or inactive, _record_create_time and _record_delete_time can be added as sort keys.

History mode retains record versions, which will increase the table size in Amazon Redshift and could impact query performance. Therefore, periodically perform DML deletes for outdated record versions (delete data beyond a certain timeframe if not needed for analysis). When executing these deletions, maintain data integrity by deleting across all related tables. Vacuuming also becomes necessary when you perform DML deletes on records whose versioning is no longer required. To improve auto vacuum delete efficiency, Amazon Redshift auto vacuum delete is more efficient when operating on bulk deletes. You can monitor vacuum progression using the SYS_VACUUM_HISTORY table.

Clean up

Complete the following steps to clean up your resources:

Conclusion

Zero-ETL integrations have already made significant strides in simplifying data integration and enabling near real-time analytics. With the addition of history mode, AWS continues to innovate, providing you with even more powerful tools to derive value from your data.

As businesses increasingly rely on data-driven decision-making, zero-ETL with history mode will be crucial in maintaining a competitive edge in the digital economy. These advancements not only streamline data processes but also open up new avenues for analysis and insight generation.

To learn more about zero-ETL integration with history mode, refer to Zero-ETL integrations and Limitations. Get started with zero-ETL on AWS by creating a free account today!

About the Authors

Raks Khare is a Senior Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers across varying industries and regions architect data analytics solutions at scale on the AWS platform. Outside of work, he likes exploring new travel and food destinations and spending quality time with his family.

Jyoti Aggarwal is a Product Management Lead for AWS zero-ETL. She leads the product and business strategy, including driving initiatives around performance, customer experience, and security. She brings along an expertise in cloud compute, data pipelines, analytics, artificial intelligence (AI), and data services including databases, data warehouses and data lakes.

Gopal Paliwal is a Principal Engineer for Amazon Redshift, leading the software development of ZeroETL initiatives for Amazon Redshift.

Harman Nagra is a Principal Solutions Architect at AWS, based in San Francisco. He works with global financial services organizations to design, develop, and optimize their workloads on AWS.

Sumanth Punyamurthula is a Senior Data and Analytics Architect at Amazon Web Services with more than 20 years of experience in leading large analytical initiatives, including analytics, data warehouse, data lakes, data governance, security, and cloud infrastructure across travel, hospitality, financial, and healthcare industries.