Training a Model in SageMaker
This article demonstrates how to train an XGBoost model using AWS SageMaker and the SageMaker SDK. By leveraging SageMaker's prebuilt XGBoost containers and defining hyperparameters such as max_depth, eta, and num_round, the training process is both efficient and scalable. Training input data is sourced directly from an S3 bucket, and the model artifacts are stored in S3 after training. The streamlined process, configured programmatically, integrates seamlessly with SageMaker Studio for monitoring metrics like AUC-PR. This approach simplifies machine learning workflows, enabling scalable and reliable model training.

Todd Bernson
2024-12-16

Objective
This article walks through training an XGBoost model in SageMaker using the SageMaker SDK. We'll cover how to configure the XGBoost container, set hyperparameters, prepare input data, and execute the training job.
Introduction to XGBoost in SageMaker
XGBoost is a highly efficient and scalable gradient boosting algorithm for supervised learning tasks like classification and regression. AWS SageMaker provides prebuilt XGBoost containers optimized for distributed training and inference.
We use the SageMaker SDK to configure and run the training job programmatically.
Retrieving the XGBoost Container
To use XGBoost in SageMaker, we retrieve the optimized container image for the AWS region and specific version.
Code Example:
from sagemaker import image_uris, Session
# Get the XGBoost container image
container = image_uris.retrieve(
framework='xgboost',
region=Session().boto_region_name,
version='1.5-1'
)
print(f"XGBoost container image: {container}")
The container image ensures SageMaker can load XGBoost and run the training job seamlessly.
Defining Hyperparameters
Hyperparameters control the XGBoost training process and performance. For this task, the following hyperparameters were selected:
- objective:
binary:logisticfor binary classification. - num_round: Number of boosting rounds.
- max_depth: Maximum depth of a tree.
- eta: Learning rate.
- subsample and colsample_bytree: Control data sampling.
Code Example:
import sagemaker
from sagemaker.estimator import Estimator
xgb = Estimator(
image_uri=container,
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.p3.2xlarge',
output_path='s3://telco-machinelearning-churn-sagemaker-jhug/models',
sagemaker_session=sagemaker.Session()
)
xgb.set_hyperparameters(
objective='binary:logistic',
num_round=700,
early_stopping_rounds=20,
max_depth=8,
eta=0.05,
subsample=0.9,
colsample_bytree=0.9,
gamma=2,
min_child_weight=5,
alpha=0.1,
eval_metric='aucpr'
)
Specifying Training Data in S3
The training input data was preprocessed and stored in S3 as train_combined.csv. We configure SageMaker to read the input data directly from S3.
Code Example:
from sagemaker.inputs import TrainingInput
train_features_path = 's3://telco-machinelearning-churn-sagemaker-jhug/data/train_combined.csv'
train_input = TrainingInput(
s3_data=train_features_path,
content_type='text/csv'
)
Executing the Training Job
The training job is initiated using the SageMaker SDK, with the specified hyperparameters, container image, and input data.
Code Example:
xgb.fit({'train': train_input})
print("Training job completed successfully.")
Once the training is complete:
- Model artifacts are saved to the specified S3 path.
- Logs and metrics (e.g., AUC-PR, accuracy) are visible in Amazon SageMaker Studio or CloudWatch.
Visual Representation of Training Workflow
Below is a simplified representation of the SageMaker training process:
- Data Preparation:
- Preprocessed data stored in S3.
- Training Job Execution:
- SageMaker provisions the compute instance (
ml.p3.2xlarge). - XGBoost container runs the training job.
- SageMaker provisions the compute instance (
- Model Artifacts:
- Trained model is saved to
s3://telco-machinelearning-churn-sagemaker-jhug/models.
- Trained model is saved to
- Logs and Metrics:
- Logs captured in CloudWatch.
- Training metrics displayed in SageMaker Studio.
Code Recap: Complete Workflow
Below is the complete code to configure and train the XGBoost model in SageMaker:
from sagemaker import image_uris, Session
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator
import sagemaker
# Retrieve the XGBoost container image
container = image_uris.retrieve(
framework='xgboost',
region=Session().boto_region_name,
version='1.5-1'
)
# Define the SageMaker Estimator
xgb = Estimator(
image_uri=container,
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.p3.2xlarge',
output_path='s3://telco-machinelearning-churn-sagemaker-jhug/models',
sagemaker_session=sagemaker.Session()
)
# Set hyperparameters
xgb.set_hyperparameters(
objective='binary:logistic',
num_round=700,
early_stopping_rounds=20,
max_depth=8,
eta=0.05,
subsample=0.9,
colsample_bytree=0.9,
gamma=2,
min_child_weight=5,
alpha=0.1,
eval_metric='aucpr'
)
# Define training input data
train_features_path = 's3://telco-machinelearning-churn-sagemaker-jhug/data/train_combined.csv'
train_input = TrainingInput(
s3_data=train_features_path,
content_type='text/csv'
)
# Run the training job
xgb.fit({'train': train_input})
By using the SageMaker SDK, we successfully trained an XGBoost model in SageMaker with custom hyperparameters and input data stored in S3. The trained model artifacts are saved back to S3 for deployment or further evaluation.
Read More
View all posts
AI/ML
Why Enterprise AI Must Be Application-Led, Not Agent-Led
A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson
2025-12-02

AI/ML
Application-First Agentic AI
Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson
2025-11-28
AI/ML
Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton
2025-08-22