Objective
This article explains how to set up the infrastructure required for SageMaker training, focusing on provisioning AWS resources using Terraform. We'll cover setting up S3 buckets, IAM roles, VPCs, and VPC endpoints to ensure secure and efficient training workflows.
Introduction to SageMaker Training Workflows
Amazon SageMaker provides an end-to-end machine learning platform that simplifies the process of building, training, and deploying ML models. SageMaker training workflows involve:
- Data Storage: Input data stored in S3 buckets.
- Compute Resources: Provisioned securely within a VPC for training and inference.
- Model Artifacts: Saved back to S3 for further use or deployment.
- Endpoints: Securely deployed for real-time or batch inference.
Setting up the correct infrastructure is critical for scalability, security, and efficiency.
Terraform Configuration for SageMaker Infrastructure
1. S3 Buckets for Raw Data, Processed Data, and Model Artifacts
S3 buckets are essential for storing:
- Raw data: Used for preprocessing.
- Processed data: Cleaned and split data ready for training.
- Model artifacts: Output of the training job.
Example Terraform Code:
module "sagemaker_s3_bucket" {
source = "terraform-aws-modules/s3-bucket/aws"
version = "~> 4.2.1"
bucket = "${local.environment}-sagemaker-${random_string.this.result}"
attach_public_policy = true
attach_policy = true
policy = data.aws_iam_policy_document.sagemaker.json
block_public_acls = false
block_public_policy = false
ignore_public_acls = false
restrict_public_buckets = false
control_object_ownership = true
object_ownership = "ObjectWriter"
expected_bucket_owner = data.aws_caller_identity.current.account_id
server_side_encryption_configuration = {
rule = {
apply_server_side_encryption_by_default = {
sse_algorithm = "AES256"
}
}
}
tags = var.tags
}
resource "aws_s3_object" "sagemaker" {
bucket = module.sagemaker_s3_bucket.s3_bucket_id
key = "${local.sagemaker_folders[0]}${local.csv_file_name}"
source = local.csv_source
depends_on = [aws_s3_object.sagemaker_folders]
}
resource "aws_s3_object" "sagemaker_folders" {
for_each = toset(local.sagemaker_folders)
bucket = module.sagemaker_s3_bucket.s3_bucket_id
key = each.key
source = "/dev/null"
}
2. IAM Roles and Policies for SageMaker Execution
SageMaker requires IAM roles with permissions to access S3 and other AWS resources. These roles ensure secure and authorized access.
Example Terraform Code:
resource "aws_iam_role" "sagemaker_execution_role" {
name = "${var.environment}_sagemaker_execution_role"
assume_role_policy = data.aws_iam_policy_document.sagemaker_execution_role.json
}
data "aws_iam_policy_document" "sagemaker_execution_role" {
statement {
actions = ["sts:AssumeRole"]
effect = "Allow"
principals {
type = "Service"
identifiers = ["sagemaker.amazonaws.com"]
}
}
}
3. VPC Setup for Secure SageMaker Endpoints
A VPC ensures that SageMaker resources communicate securely without exposure to the public internet.
Terraform Configuration:
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.14.0"
azs = local.availability_zones
cidr = var.vpc_cidr
create_database_subnet_group = true
create_flow_log_cloudwatch_iam_role = true
create_flow_log_cloudwatch_log_group = true
database_subnets = local.database_subnets
enable_dhcp_options = true
enable_dns_hostnames = true
enable_dns_support = true
enable_flow_log = true
enable_nat_gateway = true
flow_log_cloudwatch_log_group_retention_in_days = 7
flow_log_max_aggregation_interval = 60
name = var.environment
one_nat_gateway_per_az = var.vpc_redundancy ? true : false
private_subnet_suffix = "private"
private_subnets = local.private_subnets
public_subnets = local.public_subnets
single_nat_gateway = var.vpc_redundancy ? false : true
tags = var.tags
}
4. Setting Up VPC Endpoints
VPC endpoints enable private communication between SageMaker and AWS services like S3, avoiding internet traffic.
Terraform Configuration:
module "vpc_endpoints" {
source = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
version = "~> 5.14.0"
vpc_id = module.vpc.vpc_id
endpoints = {
s3 = {
service = "s3"
service_type = "Gateway"
route_table_ids = local.vpc_route_tables
}
sagemaker_api = {
service = "sagemaker.api"
private_dns_enabled = true
subnet_ids = module.vpc.private_subnets
}
sagemaker_runtime = {
service = "sagemaker.runtime"
private_dns_enabled = true
subnet_ids = module.vpc.private_subnets
}
}
}
Conclusion
Setting up the right infrastructure is crucial for SageMaker training workflows. By leveraging Terraform, we ensure reproducibility, scalability, and secure communication for all resources required for training machine learning models.