AI/ML

Setting Up Infrastructure for SageMaker Training

This article outlines the process of automating the setup of infrastructure for Amazon SageMaker training workflows using Terraform. It highlights the key components, including S3 buckets for data storage, IAM roles for secure access, VPCs for network security, and VPC endpoints for private communication. By leveraging Infrastructure as Code (IaC) with Terraform, the solution ensures consistent, scalable, and secure deployments for machine learning model training in SageMaker.

Todd Bernson

2024-12-09

Objective

This article explains how to set up the infrastructure required for SageMaker training, focusing on provisioning AWS resources using Terraform. We'll cover setting up S3 buckets, IAM roles, VPCs, and VPC endpoints to ensure secure and efficient training workflows.

Introduction to SageMaker Training Workflows

Amazon SageMaker provides an end-to-end machine learning platform that simplifies the process of building, training, and deploying ML models. SageMaker training workflows involve:

Data Storage: Input data stored in S3 buckets.
Compute Resources: Provisioned securely within a VPC for training and inference.
Model Artifacts: Saved back to S3 for further use or deployment.
Endpoints: Securely deployed for real-time or batch inference.

Setting up the correct infrastructure is critical for scalability, security, and efficiency.

Terraform Configuration for SageMaker Infrastructure

1. S3 Buckets for Raw Data, Processed Data, and Model Artifacts

S3 buckets are essential for storing:

Raw data: Used for preprocessing.
Processed data: Cleaned and split data ready for training.
Model artifacts: Output of the training job.

Example Terraform Code:

module "sagemaker_s3_bucket" {
  source  = "terraform-aws-modules/s3-bucket/aws"
  version = "~> 4.2.1"

  bucket = "${local.environment}-sagemaker-${random_string.this.result}"

  attach_public_policy = true
  attach_policy        = true
  policy               = data.aws_iam_policy_document.sagemaker.json

  block_public_acls       = false
  block_public_policy     = false
  ignore_public_acls      = false
  restrict_public_buckets = false

  control_object_ownership = true
  object_ownership         = "ObjectWriter"

  expected_bucket_owner = data.aws_caller_identity.current.account_id

  server_side_encryption_configuration = {
    rule = {
      apply_server_side_encryption_by_default = {
        sse_algorithm = "AES256"
      }
    }
  }

  tags = var.tags
}

resource "aws_s3_object" "sagemaker" {
  bucket = module.sagemaker_s3_bucket.s3_bucket_id
  key    = "${local.sagemaker_folders[0]}${local.csv_file_name}"
  source = local.csv_source

  depends_on = [aws_s3_object.sagemaker_folders]
}

resource "aws_s3_object" "sagemaker_folders" {
  for_each = toset(local.sagemaker_folders)

  bucket = module.sagemaker_s3_bucket.s3_bucket_id
  key    = each.key
  source = "/dev/null"
}

2. IAM Roles and Policies for SageMaker Execution

SageMaker requires IAM roles with permissions to access S3 and other AWS resources. These roles ensure secure and authorized access.

Example Terraform Code:

resource "aws_iam_role" "sagemaker_execution_role" {
  name               = "${var.environment}_sagemaker_execution_role"
  assume_role_policy = data.aws_iam_policy_document.sagemaker_execution_role.json
}

data "aws_iam_policy_document" "sagemaker_execution_role" {
  statement {
    actions = ["sts:AssumeRole"]
    effect  = "Allow"
    principals {
      type        = "Service"
      identifiers = ["sagemaker.amazonaws.com"]
    }
  }
}

3. VPC Setup for Secure SageMaker Endpoints

A VPC ensures that SageMaker resources communicate securely without exposure to the public internet.

Terraform Configuration:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.14.0"

  azs                                             = local.availability_zones
  cidr                                            = var.vpc_cidr
  create_database_subnet_group                    = true
  create_flow_log_cloudwatch_iam_role             = true
  create_flow_log_cloudwatch_log_group            = true
  database_subnets                                = local.database_subnets
  enable_dhcp_options                             = true
  enable_dns_hostnames                            = true
  enable_dns_support                              = true
  enable_flow_log                                 = true
  enable_nat_gateway                              = true
  flow_log_cloudwatch_log_group_retention_in_days = 7
  flow_log_max_aggregation_interval               = 60
  name                                            = var.environment
  one_nat_gateway_per_az                          = var.vpc_redundancy ? true : false
  private_subnet_suffix                           = "private"
  private_subnets                                 = local.private_subnets
  public_subnets                                  = local.public_subnets
  single_nat_gateway                              = var.vpc_redundancy ? false : true
  tags                                            = var.tags
}

4. Setting Up VPC Endpoints

VPC endpoints enable private communication between SageMaker and AWS services like S3, avoiding internet traffic.

Terraform Configuration:

module "vpc_endpoints" {
  source  = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
  version = "~> 5.14.0"

  vpc_id = module.vpc.vpc_id

  endpoints = {
    s3 = {
      service         = "s3"
      service_type    = "Gateway"
      route_table_ids = local.vpc_route_tables
    }
    sagemaker_api = {
      service             = "sagemaker.api"
      private_dns_enabled = true
      subnet_ids          = module.vpc.private_subnets
    }
    sagemaker_runtime = {
      service             = "sagemaker.runtime"
      private_dns_enabled = true
      subnet_ids          = module.vpc.private_subnets
    }
  }
}

Conclusion

Setting up the right infrastructure is crucial for SageMaker training workflows. By leveraging Terraform, we ensure reproducibility, scalability, and secure communication for all resources required for training machine learning models.

Todd Bernson

CTO

View all posts

AI/ML

Why Enterprise AI Must Be Application-Led, Not Agent-Led

A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson

2025-12-02

AI/ML

Application-First Agentic AI

Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson

2025-11-28

AI/ML

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton

2025-08-22

Setting Up Infrastructure for SageMaker Training

Objective

Introduction to SageMaker Training Workflows

Terraform Configuration for SageMaker Infrastructure

1. S3 Buckets for Raw Data, Processed Data, and Model Artifacts

2. IAM Roles and Policies for SageMaker Execution

3. VPC Setup for Secure SageMaker Endpoints

4. Setting Up VPC Endpoints

Conclusion

Read More

Why Enterprise AI Must Be Application-Led, Not Agent-Led

Application-First Agentic AI

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed