In today’s rapidly evolving digital landscape, enterprises across regulated industries face a critical challenge as they navigate their digital transformation journeys: effectively managing and governing data from legacy systems that are being phased out or replaced. This historical data, often containing valuable insights and subject to stringent regulatory requirements, must be preserved and made accessible to authorized users throughout the organization.
Failure to address this issue can lead to significant consequences, including data loss, operational inefficiencies, and potential compliance violations. Moreover, organizations are seeking solutions that not only safeguard this legacy data but also provide seamless access based on existing user entitlements, while maintaining robust audit trails and governance controls. As regulatory scrutiny intensifies and data volumes continue to grow exponentially, enterprises must develop comprehensive strategies to tackle these complex data management and governance challenges, making sure they can use their historical information assets while remaining compliant and agile in an increasingly data-driven business environment.
In this post, we explore a solution using AWS Lake Formation and AWS IAM Identity Center to address the complex challenges of managing and governing legacy data during digital transformation. We demonstrate how enterprises can effectively preserve historical data while enforcing compliance and maintaining user entitlements. This solution enables your organization to maintain robust audit trails, enforce governance controls, and provide secure, role-based access to data.
Solution overview
This is a comprehensive AWS based solution designed to address the complex challenges of managing and governing legacy data during digital transformation.
In this blog post, there are three personas:
- Data Lake Administrator (with admin level access)
- User
Silver
from the Data Engineering group - User
Lead Auditor
from the Auditor group.
You will see how different personas in an organization can access the data without the need to modify their existing enterprise entitlements.
Note: Most of the steps here are performed by Data Lake Administrator, unless specifically mentioned for other federated/user logins. If the text specifies “You” to perform this step, then it assumes that you are a Data Lake administrator with admin level access.
In this solution you move your historical data into Amazon Simple Storage Service (Amazon S3) and apply data governance using Lake Formation. The following diagram illustrates the end-to-end solution.
The workflow steps are as follows:
- You will use IAM Identity Center to apply fine-grained access control through permission sets. You can integrate IAM Identity Center with an external corporate identity provider (IdP). In this post, we have used Microsoft Entra ID as an IdP, but you can use another external IdP like Okta.
- The data ingestion process is streamlined through a robust pipeline that combines AWS Database Migration service (AWS DMS) for efficient data transfer and AWS Glue for data cleansing and cataloging.
- You will use AWS LakeFormation to preserve existing entitlements during the transition. This makes sure the workforce users retain the appropriate access levels in the new data store.
- User personas
Silver
andLead Auditor
can use their existing IdP credentials to securely access the data using Federated access. - For analytics, Amazon Athena provides a serverless query engine, allowing users to effortlessly explore and analyze the ingested data. Athena workgroups further enhance security and governance by isolating users, teams, applications, or workloads into logical groups.
The following sections walk through how to configure access management for two different groups and demonstrate how the groups access data using the permissions granted in Lake Formation.
Prerequisites
To follow along with this post, you should have the following:
- An AWS account with IAM Identity Center enabled. For more information, see Enabling AWS IAM Identity Center.
- Set up IAM Identity Center with Entra ID as an external IdP.
- In this post, we use users and groups in Entra ID. We have created two groups:
Data Engineering
andAuditor
. The userSilver
belongs to theData Engineering
andLead Auditor
belongs to theAuditor
.
Configure identity and access management with IAM Identity Center
Entra ID automatically provisions (synchronizes) the users and groups created in Entra ID into IAM Identity Center. You can validate this by examining the groups listed on the Groups page on the IAM Identity Center console. The following screenshot shows the group Data Engineering, which was created in Entra ID.
If you navigate to the group Data Engineering
in IAM Identity Center, you should see the user Silver
. Similarly, the group Auditor
has the user Lead Auditor
.
You now create a permission set, which will align to your workforce job role in IAM Identity Center. This makes sure that your workforce operates within the boundary of the permissions that you have defined for the user.
- On the IAM Identity Center console, choose Permission sets in the navigation pane.
- Click Create Permission set. Select Custom permission set and then click Next. In the next screen you will need to specify permission set details.
- Provide a permission set a name (for this post,
Data-Engineer
) while keeping rest of the option values to its default selection. - To enhance security controls, attach the inline policy text described here to
Data-Engineer
permission set, to restrict the users’ access to certain Athena workgroups. This additional layer of access management makes sure that users can only operate within the designated workgroups, preventing unauthorized access to sensitive data or resources.
For this post, we are using separate Athena workgroups for Data Engineering and Auditors. Pick a meaningful workgroup name (for example, Data-Engineer
, used in this post) which you will use during the Athena setup. Provide the AWS Region and account number in the following code with the values relevant to your AWS account.
Edit the inline policy for Data-Engineer
permission set. Copy and paste the following JSON policy text, replace parameters for the arn as suggested earlier and save the policy.
The preceding inline policy restricts anyone mapped to Data-Engineer
permission sets to only the Data-Engineer
workgroup in Athena. The users with this permission set will not be able to access any other Athena workgroup.
Next, you assign the Data-Engineer
permission set to the Data Engineering group in IAM Identity Center.
- Select AWS accounts in the navigation pane and then select the AWS account (for this post,
workshopsandbox
). - Select Assign users and groups to choose your groups and permission sets. Choose the group Data Engineering from the list of Groups, then select Next. Choose the permission set Data-Engineer from the list of permission sets, then select Next. Finally review and submit.
- Follow the previous steps to create another permission set with the name
Auditor
. - Use an inline policy similar to the preceding one to restrict access to a specific Athena workgroup for
Auditor
. - Assign the permission set
Auditor
to the groupAuditor
.
This completes the first section of the solution. In the next section, we create the data ingestion and processing pipeline.
Create the data ingestion and processing pipeline
In this step, you create a source database and move the data to Amazon S3. Although the enterprise data often resides on premises, for this post, we create an Amazon Relational Database Service (Amazon RDS) for Oracle instance in a separate virtual private cloud (VPC) to mimic the enterprise setup.
- Create an RDS for Oracle DB instance and populate it with sample data. For this post, we use the
HR
schema, which you can find in Oracle Database Sample Schemas. - Create source and target endpoints in AWS DMS:
- The source endpoint
demo-sourcedb
points to the Oracle instance. - The target endpoint
demo-targetdb
is an Amazon S3 location where the relational database will be stored in Apache Parquet format.
- The source endpoint
The source database endpoint will have the configurations required to connect to the RDS for Oracle DB instance, as shown in the following screenshot.
The target endpoint for the Amazon S3 location will have an S3 bucket name and folder where the relational database will be stored. Additional connection attributes, like DataFormat
, can be provided on the Endpoint settings tab. The following screenshot shows the configurations for demo-targetdb
.
Set the DataFormat
to Parquet for the stored data in the S3 bucket. Enterprise users can use Athena to query the data held in Parquet format.
Next, you use AWS DMS to transfer the data from the RDS for Oracle instance to Amazon S3. In large organizations, the source database could be located anywhere, including on premises.
- On the AWS DMS console, create a replication instance that will connect to the source database and move the data.
You need to carefully select the class of the instance. It should be proportionate to the volume of the data. The following screenshot shows the replication instance used in this post.
- Provide the database migration task with the source and target endpoints, which you created in the previous steps.
The following screenshot shows the configuration for the task datamigrationtask
.
- After you create the migration task, select your task and start the job.
The full data load process will take a few minutes to complete.
You have data available in Parquet format, stored in an S3 bucket. To make this data accessible for analysis by your users, you need to create an AWS Glue crawler. The crawler will automatically crawl and catalog the data stored in your Amazon S3 location, making it available in Lake Formation.
- When creating the crawler, specify the S3 location where the data is stored as the data source.
- Provide the database name
myappdb
for the crawler to catalog the data into. - Run the crawler you created.
After the crawler has completed its job, your users will be able to access and analyze the data in the AWS Glue Data Catalog with Lake Formation securing access.
- On the Lake Formation console, choose Databases in the navigation pane.
You will find mayappdb
in the list of databases.
Configure data lake and entitlement access
With Lake Formation, you can lay the foundation for a robust, secure, and compliant data lake environment. Lake Formation plays a crucial role in our solution by centralizing data access control and preserving existing entitlements during the transition from legacy systems. This powerful service enables you to implement fine-grained permissions, so your workforce users retain appropriate access levels in the new data environment.
- On the Lake Formation console, choose Data lake locations in the navigation pane.
- Choose Register location to register the Amazon S3 location with Lake Formation so it can access Amazon S3 on your behalf.
- For Amazon S3 path, enter your target Amazon S3 location.
- For IAM role¸ keep the IAM role as
AWSServiceRoleForLakeFormationDataAccess
. - For the Permission mode, select Lake Formation option to manage access.
- Choose Register location.
You can use tag-based access control to manage access to the database myappdb
.
- Create an LF-Tag data classification with the following values:
- General – To imply that the data is not sensitive in nature.
- Restricted – To imply generally sensitive data.
- HighlyRestricted – To imply that the data is highly restricted in nature and only accessible to certain job functions.
- Navigate to the database
myappdb
and on the Actions menu, choose Edit LF-Tags to assign an LF-Tag to the database. Choose Save to apply the change.
As shown in the following screenshot, we have assigned the value General to the myappdb
database.
The database myappdb
has 7 tables. For simplicity, we work with the table jobs
in this post. We apply restrictions to the columns of this table so that its data is visible to only the users who are authorized to view the data.
- Navigate to the jobs table and choose Edit schema to add LF-Tags at the column level.
- Tag the value
HighlyRestricted
to the two columnsmin_salary
andmax_salary
. - Choose Save as new version to apply these changes.
The goal is to restrict access to these columns for all users except Auditor
.
- Choose Databases in the navigation pane.
- Select your database and on the Actions menu, choose Grant to provide permissions to your enterprise users.
- For IAM users and roles, choose the role created by IAM Identity Center for the group Data Engineer. Choose the IAM role with prefix
AWSResrevedSSO_DataEngineer
from the list. This role is created as a result of creating permission sets in IAM identity Center. - In the LF-Tags section, select option Resources matched by LF-Tags. The choose Add LF-Tag key-value pair. Provide the LF-Tag key
data classification
and the values asGeneral
andRestricted
. This grants the group of users (Data Engineer) to the databasemyappdb
as long as the group is tagged with the valuesGeneral
andRestricted
. - In the Database permissions and Table permissions sections, select the specific permissions you want to give to the users in the group Data Engineering. Choose Grant to apply these changes.
- Repeat these steps to grant permissions to the role for the group
Auditor
. In this example, choose IAM role with prefixAWSResrevedSSO_Auditor
and give the data classification LF-tag to all possible values. - This grant implies that the personas logging in with the
Auditor
permission set will have access to the data that is tagged with the valuesGeneral
,Restricted
, andHighly Restricted
.
You have now completed the third section of the solution. In the next sections, we demonstrate how the users from two different groups—Data Engineer
and Auditor
—access data using the permissions granted in Lake Formation.
Log in with federated access using Entra ID
Complete the following steps to log in using federated access:
- On the IAM Identity Center console, choose Settings in the navigation pane.
- Locate the URL for the AWS access portal.
- Log in as the user Silver.
- Choose your job function
Data-Engineer
(this is the permission set from IAM Identity Center).
Perform data analytics and run queries in Athena
Athena serves as the final piece in our solution, working with Lake Formation to make sure individual users can only query the datasets they’re entitled to access. By using Athena workgroups, we create dedicated spaces for different user groups or departments, further reinforcing our access controls and maintaining clear boundaries between different data domains.
You can create Athena workgroup by navigating to Amazon Athena in AWS console.
- Select Workgroups from left navigation and choose Create Workgroup.
- On the next screen, provide workgroup name
Data-Engineer
and leave other fields as default values.- For the query result configuration, select the S3 location for the
Data-Engineer
workgroup.
- For the query result configuration, select the S3 location for the
- Chose Create workgroup.
Similarly, create a workgroup for Auditors
. Choose a separate S3 bucket for Athena Query results for each workgroup. Ensure that the workgroup name matches with the name used in arn string of the inline policy of the permission sets.
In this setup, users can only view and query tables that align with their Lake Formation granted entitlements. This seamless integration of Athena with our broader data governance strategy means that as users explore and analyze data, they’re doing so within the strict confines of their authorized data scope.
This approach not only enhances our security posture but also streamlines the user experience, eliminating the risk of inadvertent access to sensitive information while empowering users to derive insights efficiently from their relevant data subsets.
Let’s explore how Athena provides this powerful, yet tightly controlled, analytical capability to our organization.
When user Silver
accesses Athena, they’re redirected to the Athena console. According to the inline policy in the permission set, they have access to the Data-Engineer
workgroup only.
After they select the correct workgroup Data-Engineer
from the Workgroup drop-down menu and the myapp
database, it displays all columns except two columns. The min_sal
and max_sal
columns that were tagged as HighlyRestricted
are not displayed.
This outcome aligns with the permissions granted to the Data-Engineer
group in Lake Formation, making sure that sensitive information remains protected.
If you repeat the same steps for federated access and log in as Lead Auditor
, you’re similarly redirected to the Athena console. In accordance with the inline policy in the permission set, they have access to the Auditor
workgroup only.
When they select the correct workgroup Auditor
from the Workgroup dropdown menu and the myappdb
database, the job
table will display all columns.
This behavior aligns with the permissions granted to the Auditor
workgroup in Lake Formation, making sure all information is accessible to the group Auditor
.
Enabling users to access only the data they are entitled to based on their existing permissions is a powerful capability. Large organizations often want to store data without having to modify queries or adjust access controls.
This solution enables seamless data access while maintaining data governance standards by allowing users to use their current permissions. The selective accessibility helps balance organizational needs for storage and data compliance. Companies can store data without compromising different environments or sensitive information.
This granular level of access within data stores is a game changer for regulated industries or businesses seeking to manage data responsibly.
Clean up
To clean up the resources that you created for this post and avoid ongoing charges, delete the following:
- IAM Identity Center application in Entra ID
- IAM Identity Center configurations
- RDS for Oracle and DMS replication instances.
- Athena workgroups and the query results in Amazon S3
- S3 buckets
Conclusion
This AWS powered solution tackles the critical challenges of preserving, safeguarding, and scrutinizing historical data in a scalable and cost-efficient way. The centralized data lake, reinforced by robust access controls and self-service analytics capabilities, empowers organizations to maintain their invaluable data assets while enabling authorized users to extract valuable insights from them.
By harnessing the combined strength of AWS services, this approach addresses key difficulties related to legacy data retention, security, and analysis. The centralized repository, coupled with stringent access management and user-friendly analytics tools, enables enterprises to safeguard their critical information resources while simultaneously empowering sanctioned personnel to derive meaningful intelligence from these data sources.
If your organization grapples with similar obstacles surrounding the preservation and management of data, we encourage you to explore this solution and evaluate how it could potentially benefit your operations.
For more information on Lake Formation and its data governance features, refer to AWS Lake Formation Features.
About the authors
Manjit Chakraborty is a Senior Solutions Architect at AWS. He is a Seasoned & Result driven professional with extensive experience in Financial domain having worked with customers on advising, designing, leading, and implementing core-business enterprise solutions across the globe. In his spare time, Manjit enjoys fishing, practicing martial arts and playing with his daughter.
Neeraj Roy is a Principal Solutions Architect at AWS based out of London. He works with Global Financial Services customers to accelerate their AWS journey. In his spare time, he enjoys reading and spending time with his family.
Evren Sen is a Principal Solutions Architect at AWS, focusing on strategic financial services customers. He helps his customers create Cloud Center of Excellence and design, and deploy solutions on the AWS Cloud. Outside of AWS, Evren enjoys spending time with family and friends, traveling, and cycling.
Leave a Reply