Evaluating sample Amazon Redshift data sharing architecture using Redshift Test Drive and advanced SQL analysis


With the launch of Amazon Redshift Serverless and the various provisioned instance deployment options, customers are looking for tools that help them determine the most optimal data warehouse configuration to support their Amazon Redshift workloads.

Amazon Redshift is a widely used, fully managed, petabyte-scale data warehouse service. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads.

Redshift Test Drive is a tool hosted on the GitHub repository that let customers evaluate which data warehouse configurations options are best suited for their workload. The Test Drive Workload Replicator utility consists of scripts that can be used to extract the workload queries from your source warehouse audit logs and replay them on a target warehouse you launched. The Test Drive Configuration Comparison utility automates this process by deploying target Amazon Redshift warehouses and orchestrating the replay of the source workload through a combination of AWS CloudFormation and AWS StepFunctions.

Both utilities unload the performance metrics from the replay of the source workload on the target configuration(s) to Amazon Simple Storage Service (Amazon S3), which is used as a storage to store the performance metrics. Although the Replay Analysis UI and Configuration Comparison utility can provide a preliminary performance comparison, many customers want to dig deeper by analyzing the raw data themselves.

The walkthrough illustrates an example workload replayed on a single Amazon Redshift data warehouse and data sharing architecture using the Workload Replicator utility, the output of which will be used to evaluate the performance of the workload.

Use case overview

For the sample use case, we assumed we have an existing 2 x ra3.4xlarge provisioned data warehouse that currently runs extract, transform, and load (ETL), ad hoc, and business intelligence (BI) queries. We’re interested in breaking these workloads apart using data sharing into a 32 base Redshift Processing Unit (RPU) Serverless producer running ETL and a 64 base RPU Serverless consumer running the BI workload. We used Workload Replicator to replay the workload on a replica baseline of the source and target data sharing configuration as specified in the tutorial. The following image shows the process flow.

Generating and accessing Test Drive metrics

The results of Amazon Redshift Test Drive can be accessed using an external schema for analysis of a replay. Refer to the Workload Replicator README and the Configuration Comparison README for more detailed instructions to execute a replay using the respective tool.

The external schema for analysis is automatically created with the Configuration Comparison utility, in which case you can proceed directly to the SQL analysis in the Deploy the QEv2 SQL Notebook and analyze workload section. If you use Workload Replicator, however, the external schema is not created automatically, and therefore needs to be configured as a prerequisite to the SQL analysis. We demonstrate in the following walkthrough how the external schema can be set up, using sample analysis of the Data Sharing use case.

Executing Test Drive Workload Replicator for data sharing

To execute Workload Replicator, use Amazon Elastic Compute Cloud (Amazon EC2) to run the automation scripts used to extract the workload from the source.

Configure Amazon Redshift Data Warehouse

  1. Create a snapshot following the guidance in the Amazon Redshift Management Guide.
  2. Enable audit logging following the guidance in Amazon Redshift Management Guide.
  3. Enable the user activity logging of the source cluster following the guidance Amazon Redshift Management Guide.

Enabling logging requires a change of the parameter group. Audit logging needs to be enabled prior to the workload that will be replayed because this is where the connections and SQL queries of the workload will be extracted from.

  1. Launch the baseline replica from the snapshot by restoring a 2 node ra3.4xlarge provisioned cluster from the snapshot.
  2. Launch the producer warehouse by restoring the snapshot to a 32 RPU serverless namespace.
  3. The consumer should not contain the schema and tables that will be shared from the producer. You can launch the 64 RPU Serverless consumer either from the snapshot and then drop the relevant objects, or you can create a new 64 RPU Serverless consumer warehouse and recreate consumer users.
  4. Create a datashare from the producer to the consumer and add the relevant objects.

Data share objects can be read using two mechanisms: using three-part notation (database.schema.table), or by creating an external schema pointing to a shared schema and querying that using two-part notation (external_schema.table). Because we want to seamlessly run the source workload, which uses two-part notation on the local objects, this post demonstrates the latter approach. For each schema shared from the producer, run the following command on the consumer:

CREATE EXTERNAL SCHEMA schema_name 
FROM REDSHIFT DATABASE ‘datashare_database_name’ SCHEMA ‘schema_name’;

Make sure to use the same schema name as the source for the external schema. Also, if any queries are run on the public schema, drop the local public schema first before creating the external equivalent.

  1. Grant usage on the schema for any relevant users.

Configure Redshift Test Drive Workload Replicator

  1. Create an S3 bucket to store the artifacts required by the utility (such as the metrics, extracted workload, and output data from running UNLOAD commands).
  2. Launch the following three types of EC2 instances using the recommended configuration of m5.8xlarge, 32GB of SSD storage, and Amazon Linux AMI:
    1. Baseline instance
    2. Target-producer instance
    3. Target-consumer instance

Make sure you can connect to the EC2 instance to run the utility.

  1. For each instance, install the required libraries by completing the following steps from the GitHub repository:
    a. 2.i
    b. 2.ii (if an ODBC driver should be used—the default is the Amazon Redshift Python driver)
    c. 2.iii
    d. 2.iv
    e. 2.v
  2. Create an AWS Identity and Access Management (IAM) role for the EC2 instances to access the Amazon Redshift warehouses, to read from the S3 audit logging bucket, and with both read and write access to the new S3 bucket created for storing replay artifacts.
  3. If you are going to run COPY and UNLOAD commands, create an IAM role with access to the S3 buckets required, and attach it to the Amazon Redshift warehouses that will execute the load and unload.

In this example, the IAM role is attached to the baseline replica and producer warehouses because these will be executing the ETL processes. The utility will update UNLOAD commands to unload data to a bucket you define, which as a best practice should be the bucket created for S3 artifacts. Write permissions need to be granted to the Amazon Redshift warehouse for this location.

Run Redshift Test Drive Workload Replicator

  1. Run aws configure on the EC2 instances and populate the default Region with the Region the utility is being executed in.
  2. Extract only needs to be run once, so connect to the baseline EC2 instance and run vi config/extract.yaml to open the extract.yaml file and configure the extraction details (select i to begin configuring elements, then use escape to leave edit mode and :wq! to leave vi). For more details on the parameters, see Configure parameters.

The following code is an example of a configured extract that unloads the logs for a half hour window to the Test Drive artifacts bucket and updates COPY commands to run with the POC Amazon Redshift role.

Configuration Extract File

  1. Run make extract to extract the workload. When completed, make note of the folder created at the path specified for the workload_location parameter in the extract (s3://testdriveartifacts/myworkload/Extraction_xxxx-xx-xxTxx:xx:xx.xxxxxx+00:00).
  2. On the same baseline EC2 instance that will run the full workload on the source replica, run vi config/replay.yaml and configure the details with the workload location copied in the previous step 3 and the baseline warehouse endpoint. (See additional details on the parameters Configure parameters to run an extract job. The values after the analysis_iam_role parameter can be left as the default).

The following code is an example for the beginning of a replay configuration for the source replica.

Config Reply File

  1. On the EC2 instance that will run the target-producer workload, run vi config/replay.yaml. Configure the details with the workload location copied in the previous step 3, the producer warehouse endpoint and other configuration as in step 4. In order to replay only the producer workload, add the appropriate users to include or exclude for the filters parameter.

The following code is an example of the filters used to exclude the BI workload from the producer.

Producer Configuration

  1. On the EC2 instance that will run the target-consumer workload, run vi config/replay.yaml and configure the details with the workload location copied in the previous step 3, the consumer warehouse endpoint, and appropriate filters as for step 5. The same users that were excluded on the producer workload replay should be included in the consumer workload replay.

The following is an example of the filters used to only run the BI workload from the consumer.

Consumer Configuration

  1. Run make replay on the baseline instance, target-producer instance, and target-consumer instance simultaneously to run the workload on the target warehouses.

Analyze the Workload Replicator output

  1. Create the folder structure in the S3 bucket that was created in the previous step.
'{comparison_stats_s3_path}/{what_if_timestamp}/{cluster_identifier}/'

For comparison_stats_s3_path, enter the S3 bucket and path name. For what_if_timestamp, enter the replay start time. For cluster_identifier, enter the target cluster name for easy identification.

The following screenshot shows

Configuration of S3

  1. Use the following script to unload system table data for each target cluster to a corresponding Amazon S3 target path that was created previously in the baseline Redshift cluster using QEv2.
UNLOAD ($$
SELECT a.*,Trim(u.usename) as username FROM sys_query_history a , pg_user u
WHERE a.user_id = u.usesysid
and a.start_time > to_timestamp('{what_if_timestamp}','YYYY-MM- DD-HH24-MI-SS')
$$) TO '{comparison_stats_s3_path}/{what_if_timestamp}/{cluster_identifier}/'
FORMAT AS PARQUET PARALLEL OFF ALLOWOVERWRITE 
IAM_ROLE '{redshift_iam_role}';

For what_if_timestamp, enter the replay start time. For comparison_stats_s3_path, enter the S3 bucket and path name. For cluster_identifier, enter the target cluster name for easy identification. For redshift_iam_role, enter the Amazon Resource Name (ARN) of the Redshift IAM role for the target cluster.

Unload File

  1. Create an external schema in Amazon Redshift with the name comparison_stats.
CREATE EXTERNAL SCHEMA comparison_stats from DATA CATALOG
DATABASE 'redshift_config_comparison'
IAM_ROLE '{redshift-iam-role}'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

  1. Create an external table in Amazon Redshift with the name redshift_config_comparision_aggregate based on the Amazon S3 file location.
CREATE EXTERNAL TABLE comparison_stats.redshift_config_comparision_aggregate 
(user_id int,
query_id bigint,
query_label VARCHAR,
transaction_id bigint,
session_id int,
database_name VARCHAR,
query_type VARCHAR,
status VARCHAR,
result_cache_hit boolean,
start_time timestamp,
end_time timestamp,
elapsed_time bigint,
queue_time bigint,
execution_time bigint,
error_message VARCHAR,
returned_rows bigint,
returned_bytes bigint,
query_text VARCHAR,
redshift_version VARCHAR,
usage_limit VARCHAR,
compute_type VARCHAR,
compile_time bigint,
planning_time bigint,
lock_wait_time bigint,
username VARCHAR)
PARTITIONED BY (cluster_identifier VARCHAR)
STORED AS PARQUET
LOCATION '{comparison_stats_s3_path}/{what_if_timestamp}';

  1. After creating a partitioned table, alter the table using the following statement to register partitions to the external catalog.

When you add a partition, you define the location of the subfolder on Amazon S3 that contains the partition data. Run that statement for each cluster identifier.

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier="{cluster_identifier}")
LOCATION '{comparison_stats_s3_path}/{what_if_timestamp}/{cluster_identifier}/';

Example:

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier="baseline-ra3-4xlarge-2")
LOCATION 's3://workload-replicator-redshift/2024-03-05-21-00-00/baseline-ra3-4xlarge-2/';

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier="producer-serverless32RPU")
LOCATION 's3://workload-replicator-redshift/2024-03-05-21-00-00/producer-serverless32RPU/';

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier="consumer-serverless64RPU")
LOCATION 's3://workload-replicator-redshift/2024-03-05-21-00-00/consumer-serverless64RPU/';

Deploy the QEv2 SQL Notebook and analyze workload

In this section, we analyze the queries that were replayed in both the baseline and target clusters. We analyze the workload based on the common queries that are executed in the baseline and target clusters.

  1. Download the analysis notebook from Amazon S3.
  2. Import the notebook into the baseline Redshift clusters using QEv2. For guidance, refer to the Authoring and running notebooks.
  3. Create the stored procedure common_queries_sp in the same database that was used to create the external schema.
  4. The stored procedure will create a view called common_queries by querying the external table redshift_config_comparison_aggregate that was created in previous steps.

The view will identify the queries common to both the baseline and target clusters as mentioned in the notebook.

  1. Execute the stored procedure by passing the cluster identifiers for the baseline and target clusters as parameters to the stored procedure.

For this post, we passed the baseline and producer cluster identifier as the parameters. Passing the cluster identifiers as parameters will retrieve the data only for those specific clusters.

Once the common_queries view is created, you can perform further analysis using subsequent queries that are available in the notebook. If you have more than one target cluster, you can follow the same analysis process for each one. For this post, we have two target clusters: producer and consumer. We first performed the analysis between the baseline and producer clusters, then repeated the same process to analyze the data for the baseline versus consumer clusters.

To analyze our workload, we will use the sys_query_history view. We frequently use several columns from this view, including the following:

  • elapsed_time: The end-to-end time of the query run
  • execution_time: The time the query spent running. In the case of a SELECT query, this also includes the return time.
  • compile_time: The time the query spent compiling

For more information on sys_query_history, refer to SYS_QUERY_HISTORY in the Amazon Redshift Database Developer Guide. The following table shows the descriptions of the analysis queries.

Name of the query Description
1 Overall workload by user Count of common queries between baseline and target clusters based on user
2 Overall workload by query type Count of common queries between baseline and target clusters based on query type
3 Overall workload comparison (in seconds) Compare the overall workload between the baseline and target clusters by analyzing the execution time, compile time, and elapsed time
4 Percentile workload comparison The percentage of queries that perform at or below that runtime (for example, p50_s having the value of 5 seconds means 50% of queries in that workload were 5 seconds or faster)
5 Number of improve/degrade/stay same queries The number of queries degraded/stayed the same/improved when comparing the elapsed time between the baseline and target clusters
6 Degree of query-level performance change (proportion) The degree of change of the query from the baseline to target relative to the baseline performance
7 Comparison by query type (in seconds) Compare the elapsed time of different query types such as SELECT, INSERT, and COPY commands between the baseline cluster and target cluster
8 Top 10 slowest running queries (in seconds) Top 10 slowest queries between the baseline and target cluster by comparing the elapsed time of both clusters
9 Top 10 improved queries (in seconds) The top 10 queries with the most improved elapsed time when comparing the baseline cluster to the target cluster

Sample Results analysis

In our example, the overall workload improvement for workload isolation architecture using data sharing for ETL workload between baseline and producer is 858 seconds (baseline_elapsed_timetarget_elapsed_time) for the sample TPC data, as shown in the following screenshots.

Baseline vs Producer Analysis

Baseline vs Producer Analysis Grpah

The overall workload improvement for workload isolation architecture using data sharing for BI workload between baseline and consumer is 1148 seconds (baseline_elapsed_timetarget_elapsed_time) for sample TPC data, as shown in the following screenshots.

Baseline vs Consumer Analysis

Baseline vs Consumer Analysis Graph

Cleanup

Complete the following steps to clean up your resources:

  1. Delete the Redshift provisioned replica cluster and the two Redshift serverless endpoints (32 RPU and 64 RPU)
  2. Delete the S3 bucket used to store the artifacts
  3. Delete the baseline, target-producer, and target-consumer EC2 instances
  4. Delete the IAM role created for the EC2 instances to access Redshift clusters and S3 buckets
  5. Delete the IAM roles created for Amazon Redshift warehouses to access S3 buckets for COPY and UNLOAD commands

Conclusion

In this post, we walked you through the process of testing workload isolation architecture using Amazon Redshift Data Sharing and Test Drive utility. We demonstrated how you can use SQL for advanced price performance analysis and compare different workloads on different target Redshift cluster configurations. We encourage you to evaluate your Amazon Redshift data sharing architecture using the Redshift Test Drive tool. Use the provided SQL script to analyze the price-performance of your Amazon Redshift cluster.


About the Authors

Ayan Majumder is an Analytics Specialist Solutions Architect at AWS. His expertise lies in designing robust, scalable, and efficient cloud solutions for customers. Beyond his professional life, he derives joy from traveling, photography, and outdoor activities.

Ekta Ahuja is an Amazon Redshift Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys landscape photography, traveling, and board games.

Julia BeckJulia Beck is an Analytics Specialist Solutions Architect at AWS. She is passionate about supporting customers in validating and optimizing analytics solutions by architecting proof of concept workloads designed to meet their specific needs.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *