Skip to content
Setting Up Infrastructure for SageMaker Training
todd-bernson-leadership

Objective

This article explains how to set up the infrastructure required for SageMaker training, focusing on provisioning AWS resources using Terraform. We'll cover setting up S3 buckets, IAM roles, VPCs, and VPC endpoints to ensure secure and efficient training workflows.


Introduction to SageMaker Training Workflows

Amazon SageMaker provides an end-to-end machine learning platform that simplifies the process of building, training, and deploying ML models. SageMaker training workflows involve:

  1. Data Storage: Input data stored in S3 buckets.
  2. Compute Resources: Provisioned securely within a VPC for training and inference.
  3. Model Artifacts: Saved back to S3 for further use or deployment.
  4. Endpoints: Securely deployed for real-time or batch inference.

Setting up the correct infrastructure is critical for scalability, security, and efficiency.


Terraform Configuration for SageMaker Infrastructure

1. S3 Buckets for Raw Data, Processed Data, and Model Artifacts

S3 buckets are essential for storing:

  • Raw data: Used for preprocessing.
  • Processed data: Cleaned and split data ready for training.
  • Model artifacts: Output of the training job.

Example Terraform Code:

module "sagemaker_s3_bucket" {
  source  = "terraform-aws-modules/s3-bucket/aws"
  version = "~> 4.2.1"

  bucket = "${local.environment}-sagemaker-${random_string.this.result}"

  attach_public_policy = true
  attach_policy        = true
  policy               = data.aws_iam_policy_document.sagemaker.json

  block_public_acls       = false
  block_public_policy     = false
  ignore_public_acls      = false
  restrict_public_buckets = false

  control_object_ownership = true
  object_ownership         = "ObjectWriter"

  expected_bucket_owner = data.aws_caller_identity.current.account_id

  server_side_encryption_configuration = {
    rule = {
      apply_server_side_encryption_by_default = {
        sse_algorithm = "AES256"
      }
    }
  }

  tags = var.tags
}

resource "aws_s3_object" "sagemaker" {
  bucket = module.sagemaker_s3_bucket.s3_bucket_id
  key    = "${local.sagemaker_folders[0]}${local.csv_file_name}"
  source = local.csv_source

  depends_on = [aws_s3_object.sagemaker_folders]
}

resource "aws_s3_object" "sagemaker_folders" {
  for_each = toset(local.sagemaker_folders)

  bucket = module.sagemaker_s3_bucket.s3_bucket_id
  key    = each.key
  source = "/dev/null"
}

2. IAM Roles and Policies for SageMaker Execution

SageMaker requires IAM roles with permissions to access S3 and other AWS resources. These roles ensure secure and authorized access.

Example Terraform Code:

resource "aws_iam_role" "sagemaker_execution_role" {
  name               = "${var.environment}_sagemaker_execution_role"
  assume_role_policy = data.aws_iam_policy_document.sagemaker_execution_role.json
}

data "aws_iam_policy_document" "sagemaker_execution_role" {
  statement {
    actions = ["sts:AssumeRole"]
    effect  = "Allow"
    principals {
      type        = "Service"
      identifiers = ["sagemaker.amazonaws.com"]
    }
  }
}

3. VPC Setup for Secure SageMaker Endpoints

A VPC ensures that SageMaker resources communicate securely without exposure to the public internet.

Terraform Configuration:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.14.0"

  azs                                             = local.availability_zones
  cidr                                            = var.vpc_cidr
  create_database_subnet_group                    = true
  create_flow_log_cloudwatch_iam_role             = true
  create_flow_log_cloudwatch_log_group            = true
  database_subnets                                = local.database_subnets
  enable_dhcp_options                             = true
  enable_dns_hostnames                            = true
  enable_dns_support                              = true
  enable_flow_log                                 = true
  enable_nat_gateway                              = true
  flow_log_cloudwatch_log_group_retention_in_days = 7
  flow_log_max_aggregation_interval               = 60
  name                                            = var.environment
  one_nat_gateway_per_az                          = var.vpc_redundancy ? true : false
  private_subnet_suffix                           = "private"
  private_subnets                                 = local.private_subnets
  public_subnets                                  = local.public_subnets
  single_nat_gateway                              = var.vpc_redundancy ? false : true
  tags                                            = var.tags
}

4. Setting Up VPC Endpoints

VPC endpoints enable private communication between SageMaker and AWS services like S3, avoiding internet traffic.

Terraform Configuration:

module "vpc_endpoints" {
  source  = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
  version = "~> 5.14.0"

  vpc_id = module.vpc.vpc_id

  endpoints = {
    s3 = {
      service         = "s3"
      service_type    = "Gateway"
      route_table_ids = local.vpc_route_tables
    }
    sagemaker_api = {
      service             = "sagemaker.api"
      private_dns_enabled = true
      subnet_ids          = module.vpc.private_subnets
    }
    sagemaker_runtime = {
      service             = "sagemaker.runtime"
      private_dns_enabled = true
      subnet_ids          = module.vpc.private_subnets
    }
  }
}

Conclusion

Setting up the right infrastructure is crucial for SageMaker training workflows. By leveraging Terraform, we ensure reproducibility, scalability, and secure communication for all resources required for training machine learning models.

Related Articles

Inter-Region WireGuard VPN in AWS

Read more

Making PDFs Searchable Using AWS Textract and CloudSearch

Read more

Slack AI Bot with AWS Bedrock Part 2

Read more

Contact Us

Achieve a competitive advantage through BSC data analytics and cloud solutions.

Contact Us