Businesses of all sizes are challenged with the complexities and constraints posed by traditional extract, transform and load (ETL) tools. These intricate solutions, while powerful, often come with a significant financial burden, particularly for small and medium enterprise customers. Beyond the substantial costs of procurement and licensing, customers must also contend with the expenses associated with installation, maintenance, and upgrades—a perpetual cycle of investment that can strain even the most robust budgets. At Wipro, scalability of data pipelines in addition to automation remains a persistent concern for their customers and they’ve learned through customer engagements that it’s not achievable without considerable effort. As data volumes continue to swell, these tools can struggle to keep pace with the ever-increasing demand, leading to processing delays and disruptions in data delivery—a critical bottleneck in an era when timely insights are paramount.

This blog post discusses how a programmatic data processing framework developed by Wipro can help data engineers overcome obstacles and streamline their organization’s ETL processes. The framework leverages Amazon EMR improved runtime for Apache Spark and integrates with AWS Managed services.  This framework is robust and capable of connecting with multiple data sources and targets. By using capabilities from AWS managed services, the framework eliminates the undifferentiated heavy lifting typically associated with infrastructure management in traditional ETL tools, enabling customers to allocate resources more strategically. Furthermore, we will show you how the framework’s inherent scalability ensures that businesses can effortlessly adapt to increasing data volumes, fostering agility and responsiveness in an evolving digital landscape.

Solution overview

The proposed solution helps build a fully automated data processing pipeline that streamlines the entire workflow. It triggers processes when code is pushed to Git, orchestrates and schedules processing of jobs, validates data with the help of defined rules, transforms data prescribed within code, and loads the transformed datasets into a specified target. The primary component of this solution is the robust framework, developed using Amazon EMR runtime for Apache Spark. The framework can be used for any ETL process where input might be fetched from various data sources, transformed, and loaded into specified targets. To enable gaining valuable insights and provide overall job monitoring and automation, the framework is integrated with AWS managed services:

Solution walkthrough

solution architecture

The solution architecture is shown in the preceding figure and includes:

  1. Continuous integration and delivery (CI/CD) for data processing
    • Data engineers can define the underlying data processing job within a JSON template. A pre-defined template is available on GitHub for you to review syntax. At a high level, this step includes the following objectives:
      • Writing the Spark job configuration to be executed on Amazon EMR.
      • Split the data processing into three phases:
        • Parallelize fetching data from source, validate source data, and prepare the dataset for further processing.
        • Provide flexibility to write business transformation rules defined in JSON, including data validation checks such as duplicate record, null value check, and their removal. It can also include any SQL based transformation written in Apache Spark SQL.
        • Take the transformed data set and load it to the target and perform reconciliation as needed.

It’s important to highlight that each step of the three phases is recorded for auditing, and error reporting, and troubleshooting and security purposes.

    • After the data engineer has prepared the configuration file following the prescribed template in step 1 and committed it to the Git repository, it triggers the Jenkins pipeline. Jenkins is an open source continuous integration tool running on an EC2 instance that takes the configuration file, processes it, and builds (compiles the Spark application code) end artifacts—a JAR file and a configuration file (.conf) that’s copied to an S3 bucket and will be used later by Amazon EMR.
  1. CI/CD for data pipeline

The CI/CD for the data pipeline is shown in the following figure.

CICD for the data pipeline

    • After the data processing job is written, the data engineers can use a similar code-driven development approach to define the data processing pipeline to schedule, orchestrate, and execute the data processing job. Apache Airflow is a popular open source platform used for developing, scheduling, and monitoring batch-oriented workflows. In this solution, we use Amazon MWAA to execute the data pipeline through a Direct Acyclic Graph (DAG). To make it easier for engineers to build the required DAG in this solution, you can define the data pipeline in simple YAML. A sample of the YAML file is available on GitHub for review.
    • When a user commits the YAML file containing the DAG details to the project Git repository, another Jenkins pipeline is triggered.
    • The Jenkins pipeline now reads the YAML configuration file and based on the task and dependencies given, it generates the DAG script file, which is copied to the configured S3 bucket.
  1. Airflow DAG execution
    • After both the data processing job and data pipeline are configured, Amazon MWAA retrieves the most recent scripts from the S3 bucket to display the latest DAG definition in the Airflow user interface. These DAGs contain at least three tasks and except for creating and terminating an EMR cluster, every task represents an ETL process. Sample DAG code is available in GitHub. The following figure shows the DAG grid view within Amazon MWAA.

    • As defined in the schedule specified in the job, Airflow executes the create Amazon EMR task that launches the Amazon EMR cluster on the EC2 instance. After the cluster is created, the ETL processes are submitted to Amazon EMR as steps.
    • Amazon EMR executes these steps concurrently (Amazon EMR provides step concurrency levels that define how many steps to process concurrently). After the tasks are finished, the Amazon EMR cluster is terminated to save costs.
  1. ETL processing
    • Each step submitted by Airflow to Amazon EMR with a Spark submit command also includes the S3 bucket path of the configuration file passed as an argument.
    • Based on the configuration file, the input data is fetched and technical validations are applied. If data mapping has been enabled within the data processing job, then the structured data is prepared based on the given schema. This output is passed to next phase where data transformations and business validations can be applied.
    • A set of reconciliation rules are applied to the transformed data to ensure the data quality dimensions. After this step, data is loaded to specified target.

The following figure shows the ETL data processing job as executed by Amazon EMR.

ETL data processing job

  1. Logging, monitoring and notification
    • Amazon MWAA provides the logs generated by each task of the DAG within the Airflow UI. Using these logs, you can monitor Apache Airflow task details, delays, and workflow errors.
    • Amazon MWAA also frequently pings the Amazon EMR cluster to fetch the latest status of the step being executed and updates the task status accordingly.
    • If a step has failed, for example, if the database connection was not established because of high traffic, Amazon MWAA repeats the process.
    • Whenever a task has failed, an email notification is sent to the configured recipients with the failure cause and logs using Amazon SNS.

The key capabilities of this solution are:

  • Full automation: After the user commits the configuration files to Git, the remainder of the process is fully automated from when the CI/CD pipelines deploy the artifacts and DAG code. The DAG code is later executed in Airflow at the scheduled time. The entire ETL job is logged and monitored, and email notifications are sent in case of any failures.
  • Flexible integration: The application can seamlessly accommodate a new ETL process with minimal effort. To create a new process, prepare a configuration file that contains the source and target details and the necessary transformation logic. An example of how to specify your data transformation is shown in the following sample code.
    "data_transformations": [{
    "functionName": "cast column date_processed",
    "sqlQuery": "Select *, from_unixtime(UNIX_TIMESTAMP(date_processed, 'yyyy-MM-dd HH:mm:ss'), 'dd/MM/yyyy') as dateprocessed from table_details",
    "outputDFName": "table_details"
    },
    {
    "functionName": "find the reference data from lookup",
    "sqlQuery": "join_query_table_lookup.sql",
    "outputDFName": "super_search_table_details"
    }]

  • Fault tolerance: In addition to Apache Spark’s fault-tolerant capabilities, this solution offers the ability to recover data even after the Amazon EMR has been terminated. The application solution has three phases. In the event of a failure in the Apache Spark job, the output of the last successful phase is temporarily stored in Amazon S3. When the job is triggered again through Airflow DAG, the Apache Spark job will resume from the point at which it previously failed, thereby ensuring continuity and minimizing disruptions to the workflow. The following figure shows job reporting in the Amazon MWAA UI.

job reporting in the Amazon MWAA UI.

  • Scalability: As shown in the following figure, the Amazon EMR cluster is configured to use instance fleet options to scale up or down the number of nodes depending on the size of the data, which makes this application an ideal choice for businesses with growing data needs.

instance fleet options to scale up or down

  • Customizable: This solution can be customized to meet the needs of specific use cases, allowing you to add your own transformations, validations, and reconciliations according to your unique data management requirements.
  • Enhanced data flexibility: By incorporating support for multiple file formats, the Apache Spark application and Airflow DAGs gain the ability to seamlessly integrate and process data from various sources. This advantage allows data engineers to work with a wide range of file formats, including JSON, XML, Text, CSV, Parquet, Avro, and so on.
  • Concurrent execution: Amazon MWAA submits the tasks to Amazon EMR for concurrent execution, using the scalability and performance of distributed computing to process large volumes of data efficiently.
  • Proactive error notification: Email notifications are sent to configured recipients whenever a task fails, providing timely awareness of failures and facilitating prompt troubleshooting.

Considerations

In our deployments, we have noticed that the average time of a DAG completion is 15–20 minutes containing 18 ETL processes concurrently and dealing with 900 thousand to 1.2 million records each. However, if you want to further optimize the DAG completion time, you can configure the core.dag_concurrency from the Amazon MWAA console as described in Example high performance use case.

Conclusion

The proposed code-driven data processing framework developed by Wipro using Amazon EMR Runtime for Apache Spark and Amazon MWAA demonstrates a solution to address the challenges associated with traditional ETL tools. By harnessing the capabilities from open source frameworks and seamlessly integrating with AWS services, you can build cost-effective, scalable, and automated approaches for your enterprise data processing pipelines.

Now that you have seen how to use Amazon EMR Runtime for Apache Spark with Amazon MWAA , we encourage you to use Amazon MWAA to create a workflow that will run your ETL jobs on Amazon EMR Runtime for Apache Spark.

The configuration file samples and example DAG code referred in this blog post can be found in GitHub.

References

Disclaimer

Sample code, software libraries, command line tools, proofs of concept, templates, or other related technology are provided as AWS Content or third-party content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content or third-party content in your production accounts, or on production or other critical data. Performance metrics, including the stated DAG completion time, may vary based on the specific deployment environment. You are responsible for testing, securing, and optimizing the AWS Content or third-party content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content or Third-Party Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.


About the Authors

Deependra Shekhawat is a Senior Energy and Utilities Industry Specialist Solutions Architect based in Sydney, Australia. In his role, Deependra helps Energy companies across APJ region use cloud technologies to drive sustainability and operational efficiency. He specializes in creating robust data foundations and advanced workflows that enable organizations to harness the power of big data, analytics, and machine learning for solving critical industry challenges.

Senaka Ariyasinghe is a Senior Partner Solutions Architect working with Global Systems Integrators at AWS. In his role, Senaka guides AWS partners in the APJ region to design and scale well-architected solutions, focusing on generative AI, machine learning, cloud migrations, and application modernization initiatives.

Sandeep Kushwaha is a Senior Data Scientist at Wipro and has extensive experience in big data and machine learning. With a strong command of Apache Spark, Sandeep has designed and implemented cutting-edge cloud solutions that optimize data processing and drive efficiency. His expertise in using AWS services and best practices, combined with his deep knowledge of data management and automation, has enabled him to lead successful projects that meet complex technical challenges and deliver high-impact results. Solana Token Creator