Streamlining AWS Glue Studio visual jobs: Building an integrated CI/CD pipeline for seamless environment synchronization


Many Amazon Web Services (AWS) customers have integrated their data across multiple sources using AWS Glue, a serverless data integration service. By providing seamless integration throughout the development lifecycle, AWS Glue enables organizations to make data-driven business decisions.

AWS Glue Studio visual jobs provide a graphical interface called the visual editor that you can use to author extract, transform, and load (ETL) jobs in AWS Glue visually. The visual editor maintains a visual representation that a variety of data sources, transformations, and data sinks. With its intuitive interface, you can easily create large-scale data integration jobs without needing coding expertise, simplifying workflows and eliminating the need for manual ETL script programming.

As data engineers increasingly rely on the AWS Glue Studio visual editor to create data integration jobs, the need for a streamlined development lifecycle and seamless synchronization between environments has become paramount. Additionally, managing versions of visual directed acyclic graphs (DAGs) is crucial for tracking changes, collaboration, and maintaining consistency across environments.

This post introduces an end-to-end solution that addresses these needs by combining the power of the AWS Glue Visual Job API, a custom AWS Glue Resource Sync Utility, and an based continuous integration and continuous deployment (CI/CD) pipeline.

A few common questions from our customers include:

  • What are the best practices for moving our workloads from a pre-production environment to production?
  • What are the recommended best practices for provisioning data integration components?
  • How can I build AWS Glue visual jobs in the development environment and automatically propagate them to the production account using the CI/CD pipeline?
  • How can I version control and track changes to my AWS Glue Studio visual jobs?

End-to-end development lifecycle for data integration pipeline

The software development lifecycle on AWS has six phases: plan, design, implement, test, deploy, and maintain, as shown in the following diagram.

SDLC

For more information regarding each component, check out End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue.

AWS Glue Resource Sync Utility

As part of synchronizing AWS Glue visual jobs across different environments, requirements include:

  • Manage version control of visual DAGs by tracking changes to AWS Glue Studio visual jobs using version control systems such as Git
  • Promote AWS Glue visual jobs from a pre-production environment to a production environment
  • Transfer ownership of AWS Glue visual jobs between different AWS accounts
  • Replicate AWS Glue visual jobs from one AWS Region to another as part of a disaster recovery strategy

The AWS Glue Resource Sync Utility is a Python application developed on top of the AWS Glue Visual Job API, designed to synchronize AWS Glue Studio visual jobs across different accounts without losing the visual representation. It operates by using source and target AWS environment profiles. Optionally, a list of jobs for synchronization can be provided along with a mapping file to replace environment-specific resources.

For more information on the AWS Glue Resource Sync Utility, refer to Synchronize your AWS Glue Studio Visual Jobs to different environments.

Solution overview

As shown in the following diagram, this solution uses three separate AWS accounts. One account is designated for the development environment, another for the production environment, and a third to host the CI/CD infrastructure and pipeline.

Solution Overview

The solution emphasizes version controlling AWS Glue Studio visual jobs by serializing them into JSON files and storing them in a Git repository. As a result, you can:

  • Track changes to your visual DAGs over time.
  • Collaborate with team members.
  • Restore and deploy visual DAGs in different environments seamlessly.

The AWS account responsible for hosting the CI/CD pipeline is composed of three key components:

  • Managing AWS Glue Job updates – Provides smooth updates and maintenance of AWS Glue jobs.
  • Cross-Account Access Management – Enables secure promotion of updates from the development environment to the production environment.
  • Version Control Integration – Incorporates serialized visual DAGs into the CI/CD pipeline for deployment to target environments.

You can create AWS Glue Studio visual jobs using the intuitive visual editor in your development account. After these jobs are configured, they can serialize the visual DAGs into JSON files and commit them to a Git repository. The CI/CD pipeline detects changes to the repository and automatically triggers the deployment process.

The pipeline includes a step where the AWS Glue Resource Sync Utility deserializes the visual DAGs from the committed JSON files and deploys them to the production environment. This approach promotes consistent deployment of jobs while maintaining their visual representation.

The solution uses the AWS Glue Visual Job API, AWS Glue Resource Sync Utility, and AWS CDK to streamline deployment across environments. It enables seamless synchronization and consistent versioning of AWS Glue jobs between development and production, preserving visual workflows and reducing manual tasks. The solution consists of two main parts:

  • Initial steps (one-time setup) – Setting up the development environment, bootstrapping AWS environments, deploying the CI/CD pipeline, and integrating the AWS Glue Resource Sync Utility
  • Day-to-day development (repeated) – Ongoing activities such as creating visual jobs, serializing them, committing changes to the repository, deploying to production through the pipeline, and verifying the jobs

The solution follows these high-level steps for the initial setup:

  1. Set up the development environment
  2. Bootstrap your AWS environments
  3. Deploy the CI/CD pipeline
  4. Configure AWS developer tools connection on GitHub
  5. Integrate the CI/CD pipeline with the AWS Glue Resource Sync Utility

The solution follows these high-level steps for the day-to-day development:

  1. Create visual jobs in the development account
  2. Serialize visual jobs
  3. Commit changes to Git repository
  4. Deploy visual jobs to production
  5. Verify visual jobs in production

Prerequisites

Before you begin, make sure you have the following:

  • GitHub account
  • Git (git command)
  • Python 3.9 or later
  • Package installer for Python (pip command)
  • AWS CDK Toolkit (cdk command) 2.155.0 or later
  • AWS CLI configured with appropriate credentials for your accounts
  • Three AWS accounts:
    • Development account
    • Production account
    • Pipeline account (for hosting the CI/CD pipeline)

Technical solution walkthrough

This section provides a detailed guide to setting up and using an automated CI/CD pipeline for AWS Glue Studio visual jobs.

Initial steps (one-time setup)

In this section, we walk through the foundational steps required to establish the CI/CD pipeline for AWS Glue Studio visual jobs. These initial steps set up the necessary infrastructure and configurations, providing a smooth and automated deployment process across your development and production environments.

Set up the development environment

To set up the development environment, follow these steps:

  1. Fork the aws-glue-cdk-baseline repository
  2. Clone the forked repository:
git clone https://github.com/<YOUR-GITHUB-USERNAME>/aws-glue-cdk-baseline.git

cd aws-glue-cdk-baseline

  1. Create and activate a Python virtual environment:
python3 -m venv .venv

# On Windows, use .venv\\Scripts\\activate.bat
source .venv/bin/activate

  1. Install required dependencies:
pip install -r requirements.txt

pip install -r requirements-dev.txt

  1. To configure the default settings, edit the default-config.yaml file with your AWS account details and replace placeholders with your AWS account details:
  2. Pipeline account: awsAccountId and awsRegion.
  3. Development account: awsAccountId and awsRegion.
  4. Production account: awsAccountId and awsRegion.

Bootstrap your AWS environments

Bootstrapping prepares your AWS accounts for AWS CDK deployments. To bootstrap your AWS environments, run the following commands, replacing placeholders with your account numbers, Regions, and AWS CLI profiles:

# Bootstrap the pipeline account
cdk bootstrap aws://<PIPELINE-ACCOUNT-NUMBER>/<REGION> --profile <PIPELINE-PROFILE>

# Bootstrap the development account, trusting the pipeline account
cdk bootstrap aws://<DEV-ACCOUNT-NUMBER>/<REGION> --profile <DEV-PROFILE> --trust <PIPELINE-ACCOUNT-NUMBER>

# Bootstrap the production account, trusting the pipeline account
cdk bootstrap aws://<PROD-ACCOUNT-NUMBER>/<REGION> --profile <PROD-PROFILE> --trust <PIPELINE-ACCOUNT-NUMBER>

Deploy the CI/CD pipeline

Deploy the pipeline stack to your pipeline account:

cdk deploy --profile <PIPELINE-PROFILE>

This command creates:

  • The pipeline stack in the pipeline account
  • The AWS Glue app stack in the development account

Configure AWS developer tools connection to GitHub

To establish a connection between AWS CodePipeline and your GitHub repository, follow these steps:

  1. Create a GitHub connection
  2. In the AWS Management Console for your pipeline account, navigate to AWS CodePipeline
  3. In the navigation pane, choose Connections
  4. Choose Create connection
  5. Select GitHub as the source provider
  6. Authorize the connection
  7. Provide a connection name (such as MyGitHubConnection)
  8. Choose Connect to GitHub
  9. Follow the prompts to authorize AWS CodePipeline to access your GitHub account
  10. Make sure that the connection has access to your forked aws-glue-cdk-baseline repository
  11. Note the connection Amazon Resource Name (ARN)
  12. After the connection is established, note the Connection ARN because you’ll need it when configuring the pipeline

Integrate the CI/CD pipeline with the AWS Glue Resource Sync Utility

To integrate the AWS Glue Resource Sync Utility into the pipeline to automate the synchronization of AWS Glue visual jobs, follow these steps:

  1. Download the sync.py script from the AWS Glue Samples repository:
wget https://raw.githubusercontent.com/aws-samples/aws-glue-samples/master/utilities/resource_sync/sync.py \
-O aws_glue_cdk_baseline/job_scripts/sync.py

  1. Create a new file aws_glue_cdk_baseline/job_scripts/generate_mapping.py with the following content:
import yaml
import json
 
def generate_mapping():
    with open('default-config.yaml', 'r') as config_file:
        config = yaml.safe_load(config_file)
    mapping = {
        f"s3://aws-glue-assets-{config['devAccount']['awsAccountId']}-{config['devAccount']['awsRegion']}": f"s3://aws-glue-assets-{config['prodAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}",
        f"arn:aws:iam::{config['devAccount']['awsAccountId']}:role/service-role/AWSGlueServiceRole": f"arn:aws:iam::{config['prodAccount']['awsAccountId']}:role/service-role/AWSGlueServiceRole",
        f"s3://dev-glue-data-{config['devAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}": f"s3://prod-glue-data-{config['prodAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}"
    }
    with open('mapping.json', 'w') as mapping_file:
        json.dump(mapping, mapping_file, indent=2)
 
if __name__ == "__main__":
    generate_mapping()

This script generates a mapping.json file that the sync.py script will use to synchronize the jobs between the development and production environments. The mapping.json file contains the mapping of the development environment assets to the production environment assets:

  • The s3://aws-glue-assets-* Amazon Simple Storage Service (Amazon S3) bucket contains the AWS Glue Studio visual job definitions
  • The arn:aws:iam::*:role/service-role/AWSGlueServiceRole AWS Identity and Access Management (IAM) role is used by the AWS Glue Studio jobs to access AWS resources
  • The s3://dev-glue-data-* and s3://prod-glue-data-* S3 buckets contain scripts and data used by the AWS Glue Studio jobs
  1. Update the aws_glue_cdk_baseline/pipeline_stack.py file to include a step that deserializes the JSON file and deploys the AWS Glue jobs to the production environment:
from typing import Dict
import aws_cdk as cdk
from aws_cdk import (
    Stack,
    aws_iam as iam
)
from constructs import Construct
from aws_cdk.pipelines import CodePipeline, CodePipelineSource, CodeBuildStep
from aws_glue_cdk_baseline.glue_app_stage import GlueAppStage
 
GITHUB_REPO = "YOUR-GITHUB-USERNAME/aws-glue-cdk-baseline"
GITHUB_BRANCH = "main"
GITHUB_CONNECTION_ARN = "YOUR-GITHUB-CONNECTION-ARN"
 
class PipelineStack(Stack):
 
    def __init__(self, scope: Construct, construct_id: str, config: Dict, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)
 
        source = CodePipelineSource.connection(
            GITHUB_REPO,
            GITHUB_BRANCH,
            connection_arn=GITHUB_CONNECTION_ARN
        )
 
        pipeline = CodePipeline(self, "GluePipeline",
            pipeline_name="GluePipeline",
            cross_account_keys=True,
            docker_enabled_for_synth=True,
            synth=CodeBuildStep("CdkSynth",
                input=source,
                install_commands=[
                    "pip install -r requirements.txt",
                    "pip install -r requirements-dev.txt",
                    "npm install -g aws-cdk",
                ],
                commands=[
                    "cdk synth",
                ]
            )
        )
 
        # Add development stage
        dev_stage = GlueAppStage(self, "DevStage", config=config, stage="dev", 
            env=cdk.Environment(
                account=str(config['devAccount']['awsAccountId']),
                region=config['devAccount']['awsRegion']
            ))
        pipeline.add_stage(dev_stage)

        # Add production stage
        prod_stage = GlueAppStage(self, "ProdStage", config=config, stage="prod", 
            env=cdk.Environment(
                account=str(config['prodAccount']['awsAccountId']),
                region=config['prodAccount']['awsRegion']
            ))
        pipeline.add_stage(prod_stage)
 
        # Glue Resource Sync as a separate step in the pipeline
        pipeline.add_wave("GlueJobSync").add_post(CodeBuildStep("GlueJobSync",
            input=source,
            commands=[
                "python $(pwd)/aws_glue_cdk_baseline/job_scripts/generate_mapping.py",
                "python aws_glue_cdk_baseline/job_scripts/sync.py "
                   "--dst-role-arn arn:aws:iam::{0}:role/GlueCrossAccountRole-prod "
                   "--dst-region {1} "
                   "--deserialize-from-file aws_glue_cdk_baseline/resources/resources.json "
                   "--config-path mapping.json "
                   "--targets job,catalog "
                   "--skip-prompt".format(
                       config['prodAccount']['awsAccountId'],
                       config['prodAccount']['awsRegion']
                   ),
            ],
            role_policy_statements=[
                iam.PolicyStatement(
                    actions=[
                        "sts:AssumeRole",
                    ],
                    resources=["*"]
                )
            ]
        ))

Replace the placeholders in the pipeline_stack.py file with your values:

  • GITHUB_REPO with the name of your GitHub repository
  • GITHUB_BRANCH with the name of the branch you want to use for the pipeline
  • GITHUB_CONNECTION_ARN with the ARN of the GitHub connection you created in Step 4
  1. Update the aws_glue_cdk_baseline/glue_app_stack.py file to create a cross-account role with the necessary permissions to access the development environment resources:
    self.cross_account_role = self.create_cross_account_role(
        f"GlueCrossAccountRole-{stage}",
        str(config['pipelineAccount']['awsAccountId'])
    )
 
    def create_cross_account_role(self, role_name: str, trusted_account_id: str):
        return iam.Role(self, f"{role_name}CrossAccountRole",
            role_name=role_name,
            assumed_by=iam.AccountPrincipal(trusted_account_id),
            managed_policies=[iam.ManagedPolicy.from_aws_managed_policy_name("AdministratorAccess")]
        )
 
    @property
    def cross_account_role_arn(self):
        return self.cross_account_role.role_arn

    @property
    def cross_account_role_arn(self):
        return self.glue_app_stack.cross_account_role_arn

Check the andreimaksimov/aws-glue-cdk-baseline for a complete diff.

  1. Commit your changes to the repository:
git add aws_glue_cdk_baseline/job_scripts/sync.py
git add aws_glue_cdk_baseline/job_scripts/generate_mapping.py
git add pipeline_stack.py

git commit -m "Integrate Glue Resource Sync Utility into the pipeline"

git push

Day-to-day development (repeated)

With the initial setup complete, you can now proceed with your regular development activities. This section outlines the steps you’ll repeat during your day-to-day work to develop, version control, and deploy AWS Glue visual jobs.

Create visual jobs in the development account

In this step, you’ll use AWS Glue Studio to create and configure your visual jobs within the development environment.

  1. In your development account, in AWS Glue Studio, select AWS Glue Studio
  2. To create a new visual job, choose Create job
  3. Choose Visual with a blank canvas and use the visual editor to design your ETL job
  4. Configure the job settings:
  5. Job name: Provide a meaningful name
  6. IAM role: Select an IAM role with necessary permissions
  7. Other configurations: Adjust as needed
  8. To save the job, choose Save

Repeat these steps to create additional jobs as required.

Serialize visual jobs

To serialize your visual jobs to enable version control and preparation for deployment, follow these steps:

  1. Run the AWS Glue Resource Sync Utility:
python sync.py \
  --src-role-arn arn:aws:iam::<DEV-ACCOUNT-NUMBER>:role/GlueCrossAccountRole-dev \
  --src-region us-east-1 \
  --serialize-to-file resources.json \
  --targets job,catalog \
  --skip-prompt

  1. Replace <DEV-ACCOUNT-NUMBER> with your development account number
  2. Replace <DEV-REGION> with your development Region (for example, us-east-1)
  3. Verify the serialized file:
  4. Locate JSON in aws_glue_cdk_baseline/resources/
  5. Make sure it contains the definitions of your visual jobs

Commit changes to Git repository

To commit changes to the Git repository, follow these steps:

  1. Add the serialized resources to Git:
git add aws_glue_cdk_baseline/resources/resources.json

  1. Commit your changes:
git commit -m "Add serialized Glue Visual Jobs"

  1. Push to GitHub:

This action triggers the CI/CD pipeline.

Deploy visual jobs to production

The CI/CD pipeline automatically deploys the following changes:

  • Synthesize the AWS CDK application
  • Deploy to the development environment
  • Deploy to the production environment
  • Execute the AWS Glue Resource Sync Utility

The following screenshot shows the CI/CD pipeline.

CICD Pipeline

Verify visual jobs in production

After the pipeline has completed the deployment, it’s important to verify that the visual jobs are correctly reflected in the production environment. To do so, follow these steps:

  1. In the production account, on the AWS Glue Studio console, select AWS Glue Studio
  2. Verify the deployed jobs:
  3. Make sure that the visual jobs are present
  4. Open each job to confirm that the visual DAGs are preserved

By following these steps in your day-to-day workflow, you make sure that your AWS Glue visual jobs are version-controlled, consistent across environments, and that your production environment reflects the latest tested changes.

Version control for AWS Glue visual jobs

By serializing AWS Glue Studio visual jobs to JSON files and committing them to a Git repository, you enable version control for your data integration workflows. By following this approach you can:

  • Track Changes – Monitor modifications to your AWS Glue jobs over time
  • Collaborate – Work with team members on developing and refining jobs
  • Restore and deploy – Easily restore jobs in other accounts or environments

The serialization and deserialization steps are integral to your development and deployment process, making sure that all changes are captured and seamlessly propagated.

Conclusion

By combining the AWS Glue Visual Job API, AWS Glue Resource Sync Utility, and an AWS CDK based CI/CD pipeline, we’ve crafted a comprehensive solution for managing AWS Glue Studio visual jobs across different environments. This integrated approach offers several benefits:

  • Version control integration – Manage and track changes to your AWS Glue visual jobs using Git, enabling collaboration and change tracking
  • Streamlined development – Easily develop and test AWS Glue jobs using the Visual Editor in the development environment
  • Automated deployment – Use a CI/CD pipeline to automatically deploy serialized visual DAGs to the production environment
  • Environment consistency – Promote consistency across development and production environments by using the same job definitions
  • Visual representation preservation – Maintain the visual DAG representation when synchronizing jobs between environments

This solution empowers data engineers to focus on building robust data integration pipelines while automating the complexities of managing and deploying AWS Glue Studio visual jobs across multiple environments.

We encourage you to try this solution and adapt it to your needs. As always, we welcome your feedback and suggestions for further improvements.


About the Authors

Andrei MaksimovAndrei Maksimov is an AWS Senior Cloud Infrastructure Architect specializing in cloud infrastructure, software development, and DevOps. He designs and implements scalable, secure, and efficient cloud solutions and helps customers optimize their cloud environments. Outside of work, Andrei enjoys participating in hackathons, contributing to open source projects, and exploring the latest advancements in AI. You can connect with him on LinkedIn.

David ZhangDavid Zhang is an AWS Data Architect specializing in designing and implementing analytics infrastructure, data management, ETL, and extensive data systems. He helps customers modernize their AWS data platforms. David is also an active speaker at AWS conferences and contributor to AWS conferences, technical content, and open source initiatives. He enjoys playing volleyball, tennis, and weightlifting in his free time. Feel free to connect with him on LinkedIn.

Noritaka SekiyamaNoritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for designing AWS features, implementing software artifacts, and helping with customer architectures. In his spare time, he enjoys watching anime on Prime Video. You can connect with him on LinkedIn.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *