AWS Lake Formation: Part 1 Architectural Deep Dive

AWS Lake Formation: Part 1 Architectural Deep Dive By Todd Bernson BSC Analytics

25 Jun 2024

This series will investigate how Lake Formation can define your data lake or mesh governance. We will take a comprehensive technical examination of AWS Lake Formation, focusing on its architecture, internal mechanics, and interactions with other AWS services. The series aims to equip readers with a deeper understanding of how Lake Formation functions at a granular level and integrates within the broader AWS ecosystem.

Clone the project here.

Intro

AWS Lake Formation simplifies setting up a secure and efficient data lake. It automates many of the tedious tasks associated with data lakes, such as data loading, cataloging, cleaning, and securing. This section will break down the architectural components that enable these functionalities.

Key Components

Data Catalog: The central metadata repository is an AWS Glue Data Catalog extension.
Security and Access Control: Manages permissions at a granular column and row level using a central policy definition.
Workflow and Pipeline Management: Coordinates data ingestion, transformation, and cleaning processes.

Deep Dive into Lake Formation Components

Data Catalog

The Data Catalog is the core component where metadata about all data assets is stored. It supports searching and querying of data, making it easier to manage large datasets.

Security and Access Control

Lake Formation integrates tightly with AWS Identity and Access Management (IAM) to provide detailed access controls, ensuring that only authorized users and roles can access specific data resources. Here we are granting specific permissions to the role Glue uses.

resource "aws_lakeformation_permissions" "glue_catalog_database_permissions" {

  principal = aws_iam_role.glue_service_role.arn

  permissions = [

    "ALTER",

    "CREATE_TABLE",

    "DROP",

  ]



  database {

    name = aws_glue_catalog_database.this.name

  }

}

Integration with Other AWS Services

Lake Formation does not operate in isolation but integrates seamlessly with various AWS services to enhance its capabilities. We use several of those services in this project.

Integration with Amazon Athena

Lake Formation's Data Catalog is natively accessible from Athena, allowing for serverless querying of data lake content.

aws athena start-query-execution \  

    --query-string "SELECT * FROM data_50600a86b68063ce3940961a3222e0bf LIMIT 10;" \

    --query-execution-context Database=aws_lakeformation_poc_dqbn \

    --result-configuration OutputLocation=s3://aws-lakeformation-poc-dqbn/ \

    --profile lfrole

aws athena get-query-execution \

    --output text --query 'QueryExecution.Status.State' \

    --query-execution-id 3faf9805-4efc-43f7-88c5-b11d51b746ea



SUCCEEDED

This deep dive into AWS Lake Formation's architecture provides insights into its internal mechanisms and demonstrates its flexible integration with other AWS services, establishing a robust foundation for managing and securing data lakes.

Integration with Glue

Lake Formation uses glue to crawl and ingest the data. It provides all the governance over that data.

Lake Formation is managing fine-grained permissions to glue resources.

Visit my website here.

AWS Lake Formation: Part 1 Architectural Deep Dive

Intro

Key Components

Deep Dive into Lake Formation Components

Data Catalog

Security and Access Control

Integration with Other AWS Services

Integration with Amazon Athena

Integration with Glue

Related Posts

Related Articles

Inter-Region WireGuard VPN in AWS

Making PDFs Searchable Using AWS Textract and CloudSearch

Slack AI Bot with AWS Bedrock Part 2

Contact Us