Data Modernization

AWS Lake Formation: Part 3 Configuring Complex AWS Glue Workflows

In this series installment on AWS Lake Formation, I'll discuss configuring the complex AWS Glue workflows using Terraform. Specifically, I focus on sett...

Todd Bernson

2024-09-28

In this series installment on AWS Lake Formation, I'll discuss configuring the complex AWS Glue workflows using Terraform.

Specifically, I focus on setting up multi-source data crawlers and managing dependencies and triggers for AWS Glue jobs. The goal is to demonstrate how Terraform can automate and streamline the management of AWS Glue resources.

Clone the project here.

Setting Up Multi-Source Data Crawlers

AWS Glue Crawlers scan various data sources and populate the AWS Glue Data Catalog with metadata tables. These tables are then used by ETL jobs and query services like Amazon Athena. In a complex data environment, a crawler might need to access multiple sources, including various S3 buckets, databases, or even streams.

Terraform Configuration for a Multi-Source Crawler:

resource "aws_glue_crawler" "multi_source_crawler" {
  name          = "${local.environment}_multi_source_crawler"
  role          = aws_iam_role.glue_service_role.arn
  database_name = aws_glue_catalog_database.this.name

  s3_target {
    path = "s3://${data.aws_s3_bucket.bucket.bucket}/data_source_one/"
  }

  s3_target {
    path = "s3://${data.aws_s3_bucket.bucket.bucket}/data_source_two/"
  }

  jdbc_target {
    connection_name = "db-connection"
    path = "database-schema"
  }

  schema_change_policy {
    delete_behavior = "LOG"
    update_behavior = "UPDATE_IN_DATABASE"
  }

  schedule = "cron(0 3 * * ? *)"
}

In this configuration, the Glue Crawler is set to scan two different S3 paths and a JDBC target, which could be a relational database accessible via a connection defined in AWS Glue. This demonstrates the crawler's ability to integrate data from multiple sources into a unified data catalog.

Managing Dependencies and Triggers for AWS Glue Jobs

Dependencies and triggers in AWS Glue control the execution order and scheduling of ETL jobs based on certain conditions or schedules. Managing these programmatically via Terraform allows for a highly dynamic and responsive data processing environment.

Example Terraform Configuration for Glue Job Triggers:

resource "aws_glue_trigger" "data_processing_trigger" {
  name     = "daily_data_processing_trigger"
  type     = "SCHEDULED"
  schedule = "cron(0 4 * * ? *)"

  actions {
    job_name = aws_glue_job.data_processing_job.name
  }
}

resource "aws_glue_job" "data_processing_job" {
  name     = "daily_data_processing"
  role_arn = aws_iam_role.glue_service_role.arn

  command {
    script_location = "s3://${data.aws_s3_bucket.bucket.bucket}/scripts/data_processing.py"
    python_version  = "3"
  }

  default_arguments = {
    "--TempDir" = "s3://${data.aws_s3_bucket.bucket.bucket}/tempdir"
    "--extra-py-files" = "s3://${data.aws_s3_bucket.bucket.bucket}/libs/common_libs.zip"
  }
}

In this configuration, a trigger is set up to run a Glue job daily at 4 AM UTC. This setup not only automates daily data processing tasks but also ensures that the data handling is dynamic, based on the data available up to the previous day.

Through Terraform, you can achieve sophisticated management of AWS Glue resources, enabling complex workflows that respond dynamically to your data needs. This setup optimizes the data processing lifecycle in your data lake environment, ensuring efficiency and scalability. By leveraging Terraform for these configurations, you can maintain a clear, version-controlled, and automated setup, reducing manual errors and increasing operational efficiency.

Visit my website here.

Todd Bernson

CTO

View all posts

AI/ML

Why Enterprise AI Must Be Application-Led, Not Agent-Led

A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson

2025-12-02

AI/ML

Application-First Agentic AI

Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson

2025-11-28

AI/ML

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton

2025-08-22

AWS Lake Formation: Part 3 Configuring Complex AWS Glue Workflows

Setting Up Multi-Source Data Crawlers

Managing Dependencies and Triggers for AWS Glue Jobs

Read More

Why Enterprise AI Must Be Application-Led, Not Agent-Led

Application-First Agentic AI

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed