AI/ML

Automating the Training Pipeline with Terraform and SageMaker

Learn how to automate the end-to-end SageMaker ML pipeline with Terraform and the SageMaker SDK. This guide covers S3 data uploads, training job configurations, real-time endpoint deployments, and the seamless integration of infrastructure provisioning and model workflows. Achieve scalability, reproducibility, and cost-efficiency by combining Terraform's infrastructure as code and SageMaker's machine learning capabilities.

Todd Bernson

2025-01-14

Objective

This article demonstrates how to automate the entire SageMaker training and deployment pipeline using Terraform and the SageMaker SDK. We will cover data uploads, model training, endpoint deployment, and the benefits of automation for scalability and reproducibility.

Combining Terraform and Python SDK for ML Pipelines

By integrating Terraform and the SageMaker SDK, we can achieve a fully automated ML pipeline. Terraform provisions and configures AWS resources, while the SageMaker SDK handles training and inference workflows programmatically.

Key Components of the Pipeline

Terraform:
- Configures S3 buckets, SageMaker endpoints, and IAM roles.
- Automates resource creation and updates.
SageMaker SDK:
- Prepares data.
- Initiates training jobs and endpoint deployments.

Automating the Pipeline

1. Data Upload to S3

The first step in the pipeline is to upload training and testing data to an S3 bucket provisioned by Terraform.

Code Example:

import boto3

s3 = boto3.client('s3')
bucket_name = 'telco-machinelearning-churn-sagemaker-jhug'

files_to_upload = {
    'train_combined.csv': 'data/train_combined.csv',
    'test_features.csv': 'data/test_features.csv',
    'test_labels.csv': 'data/test_labels.csv'
}

for local_file, s3_key in files_to_upload.items():
    s3.upload_file(local_file, bucket_name, s3_key)
    print(f"Uploaded {local_file} to s3://{bucket_name}/{s3_key}")

2. Model Training

Training jobs are configured and executed using the SageMaker SDK.

Code Example:

xgb.fit({
    'train': TrainingInput(
        s3_data='s3://telco-machinelearning-churn-sagemaker-jhug/data/train_combined.csv',
        content_type='text/csv'
    )
})
print("Training job completed successfully.")

3. Endpoint Deployment

After training, the model is deployed to a SageMaker endpoint for real-time inference.

Terraform Snippet for Endpoint Setup:

resource "aws_sagemaker_endpoint" "this" {
    endpoint_config_name = aws_sagemaker_endpoint_configuration.this.name
}

Code Example:

predictor = xgb.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    serializer=CSVSerializer(),
    deserializer=JSONDeserializer()
)
print(f"Endpoint deployed: {predictor.endpoint_name}")

Python Script for Invoking Terraform Commands

Automation is extended by invoking Terraform commands programmatically from Python, ensuring a seamless pipeline from resource provisioning to model deployment.

Code Example:

import subprocess

def run_terraform(command):
    result = subprocess.run(
        ['terraform'] + command.split(),
        capture_output=True,
        text=True
    )
    print(result.stdout)
    if result.returncode != 0:
        print(result.stderr)
        raise Exception("Terraform command failed")

# Initialize and apply Terraform configurations
run_terraform("init")
run_terraform("apply -auto-approve")

Benefits of Automation for Scaling and Reproducibility

Scalability:
- Automatically scale training jobs and endpoints based on requirements.
- Quickly adjust configurations for different datasets or models.
Reproducibility:
- Terraform ensures infrastructure consistency across environments.
- Automating the pipeline reduces manual errors and ensures repeatable workflows.
Cost Efficiency:
- Resources are provisioned only when needed and terminated after use.
Flexibility:
- Easily integrate additional steps, such as data preprocessing or post-processing.

By combining Terraform and the SageMaker SDK, we created an automated, scalable, and reproducible pipeline for training and deploying ML models. Automation streamlines workflows, reduces manual effort, and ensures consistency across environments.

Todd Bernson

CTO

View all posts

AI/ML

Why Enterprise AI Must Be Application-Led, Not Agent-Led

A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson

2025-12-02

AI/ML

Application-First Agentic AI

Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson

2025-11-28

AI/ML

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton

2025-08-22

Automating the Training Pipeline with Terraform and SageMaker

Objective

Combining Terraform and Python SDK for ML Pipelines

Key Components of the Pipeline

Automating the Pipeline

1. Data Upload to S3

2. Model Training

3. Endpoint Deployment

Python Script for Invoking Terraform Commands

Benefits of Automation for Scaling and Reproducibility

Read More

Why Enterprise AI Must Be Application-Led, Not Agent-Led

Application-First Agentic AI

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed