Setting Up Infrastructure for SageMaker Training
This article outlines the process of automating the setup of infrastructure for Amazon SageMaker training workflows using Terraform. It highlights the key components, including S3 buckets for data storage, IAM roles for secure access, VPCs for network security, and VPC endpoints for private communication. By leveraging Infrastructure as Code (IaC) with Terraform, the solution ensures consistent, scalable, and secure deployments for machine learning model training in SageMaker.

Todd Bernson
2024-12-09

Objective
This article explains how to set up the infrastructure required for SageMaker training, focusing on provisioning AWS resources using Terraform. We'll cover setting up S3 buckets, IAM roles, VPCs, and VPC endpoints to ensure secure and efficient training workflows.
Introduction to SageMaker Training Workflows
Amazon SageMaker provides an end-to-end machine learning platform that simplifies the process of building, training, and deploying ML models. SageMaker training workflows involve:
- Data Storage: Input data stored in S3 buckets.
- Compute Resources: Provisioned securely within a VPC for training and inference.
- Model Artifacts: Saved back to S3 for further use or deployment.
- Endpoints: Securely deployed for real-time or batch inference.
Setting up the correct infrastructure is critical for scalability, security, and efficiency.
Terraform Configuration for SageMaker Infrastructure
1. S3 Buckets for Raw Data, Processed Data, and Model Artifacts
S3 buckets are essential for storing:
- Raw data: Used for preprocessing.
- Processed data: Cleaned and split data ready for training.
- Model artifacts: Output of the training job.
Example Terraform Code:
module "sagemaker_s3_bucket" {
source = "terraform-aws-modules/s3-bucket/aws"
version = "~> 4.2.1"
bucket = "${local.environment}-sagemaker-${random_string.this.result}"
attach_public_policy = true
attach_policy = true
policy = data.aws_iam_policy_document.sagemaker.json
block_public_acls = false
block_public_policy = false
ignore_public_acls = false
restrict_public_buckets = false
control_object_ownership = true
object_ownership = "ObjectWriter"
expected_bucket_owner = data.aws_caller_identity.current.account_id
server_side_encryption_configuration = {
rule = {
apply_server_side_encryption_by_default = {
sse_algorithm = "AES256"
}
}
}
tags = var.tags
}
resource "aws_s3_object" "sagemaker" {
bucket = module.sagemaker_s3_bucket.s3_bucket_id
key = "${local.sagemaker_folders[0]}${local.csv_file_name}"
source = local.csv_source
depends_on = [aws_s3_object.sagemaker_folders]
}
resource "aws_s3_object" "sagemaker_folders" {
for_each = toset(local.sagemaker_folders)
bucket = module.sagemaker_s3_bucket.s3_bucket_id
key = each.key
source = "/dev/null"
}
2. IAM Roles and Policies for SageMaker Execution
SageMaker requires IAM roles with permissions to access S3 and other AWS resources. These roles ensure secure and authorized access.
Example Terraform Code:
resource "aws_iam_role" "sagemaker_execution_role" {
name = "${var.environment}_sagemaker_execution_role"
assume_role_policy = data.aws_iam_policy_document.sagemaker_execution_role.json
}
data "aws_iam_policy_document" "sagemaker_execution_role" {
statement {
actions = ["sts:AssumeRole"]
effect = "Allow"
principals {
type = "Service"
identifiers = ["sagemaker.amazonaws.com"]
}
}
}
3. VPC Setup for Secure SageMaker Endpoints
A VPC ensures that SageMaker resources communicate securely without exposure to the public internet.
Terraform Configuration:
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.14.0"
azs = local.availability_zones
cidr = var.vpc_cidr
create_database_subnet_group = true
create_flow_log_cloudwatch_iam_role = true
create_flow_log_cloudwatch_log_group = true
database_subnets = local.database_subnets
enable_dhcp_options = true
enable_dns_hostnames = true
enable_dns_support = true
enable_flow_log = true
enable_nat_gateway = true
flow_log_cloudwatch_log_group_retention_in_days = 7
flow_log_max_aggregation_interval = 60
name = var.environment
one_nat_gateway_per_az = var.vpc_redundancy ? true : false
private_subnet_suffix = "private"
private_subnets = local.private_subnets
public_subnets = local.public_subnets
single_nat_gateway = var.vpc_redundancy ? false : true
tags = var.tags
}
4. Setting Up VPC Endpoints
VPC endpoints enable private communication between SageMaker and AWS services like S3, avoiding internet traffic.
Terraform Configuration:
module "vpc_endpoints" {
source = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
version = "~> 5.14.0"
vpc_id = module.vpc.vpc_id
endpoints = {
s3 = {
service = "s3"
service_type = "Gateway"
route_table_ids = local.vpc_route_tables
}
sagemaker_api = {
service = "sagemaker.api"
private_dns_enabled = true
subnet_ids = module.vpc.private_subnets
}
sagemaker_runtime = {
service = "sagemaker.runtime"
private_dns_enabled = true
subnet_ids = module.vpc.private_subnets
}
}
}
Conclusion
Setting up the right infrastructure is crucial for SageMaker training workflows. By leveraging Terraform, we ensure reproducibility, scalability, and secure communication for all resources required for training machine learning models.
Read More
View all posts
AI/ML
Why Enterprise AI Must Be Application-Led, Not Agent-Led
A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson
2025-12-02

AI/ML
Application-First Agentic AI
Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson
2025-11-28
AI/ML
Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton
2025-08-22