Objective
This article walks through training an XGBoost model in SageMaker using the SageMaker SDK. We'll cover how to configure the XGBoost container, set hyperparameters, prepare input data, and execute the training job.
Introduction to XGBoost in SageMaker
XGBoost is a highly efficient and scalable gradient boosting algorithm for supervised learning tasks like classification and regression. AWS SageMaker provides prebuilt XGBoost containers optimized for distributed training and inference.
We use the SageMaker SDK to configure and run the training job programmatically.
Retrieving the XGBoost Container
To use XGBoost in SageMaker, we retrieve the optimized container image for the AWS region and specific version.
Code Example:
from sagemaker import image_uris, Session
# Get the XGBoost container image
container = image_uris.retrieve(
framework='xgboost',
region=Session().boto_region_name,
version='1.5-1'
)
print(f"XGBoost container image: {container}")
The container image ensures SageMaker can load XGBoost and run the training job seamlessly.
Defining Hyperparameters
Hyperparameters control the XGBoost training process and performance. For this task, the following hyperparameters were selected:
-
objective:
binary:logistic
for binary classification. - num_round: Number of boosting rounds.
- max_depth: Maximum depth of a tree.
- eta: Learning rate.
- subsample and colsample_bytree: Control data sampling.
Code Example:
import sagemaker
from sagemaker.estimator import Estimator
xgb = Estimator(
image_uri=container,
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.p3.2xlarge',
output_path='s3://telco-machinelearning-churn-sagemaker-jhug/models',
sagemaker_session=sagemaker.Session()
)
xgb.set_hyperparameters(
objective='binary:logistic',
num_round=700,
early_stopping_rounds=20,
max_depth=8,
eta=0.05,
subsample=0.9,
colsample_bytree=0.9,
gamma=2,
min_child_weight=5,
alpha=0.1,
eval_metric='aucpr'
)
Specifying Training Data in S3
The training input data was preprocessed and stored in S3 as train_combined.csv
. We configure SageMaker to read the input data directly from S3.
Code Example:
from sagemaker.inputs import TrainingInput
train_features_path = 's3://telco-machinelearning-churn-sagemaker-jhug/data/train_combined.csv'
train_input = TrainingInput(
s3_data=train_features_path,
content_type='text/csv'
)
Executing the Training Job
The training job is initiated using the SageMaker SDK, with the specified hyperparameters, container image, and input data.
Code Example:
xgb.fit({'train': train_input})
print("Training job completed successfully.")
Once the training is complete:
- Model artifacts are saved to the specified S3 path.
- Logs and metrics (e.g., AUC-PR, accuracy) are visible in Amazon SageMaker Studio or CloudWatch.
Visual Representation of Training Workflow
Below is a simplified representation of the SageMaker training process:
-
Data Preparation:
- Preprocessed data stored in S3.
-
Training Job Execution:
- SageMaker provisions the compute instance (
ml.p3.2xlarge
). - XGBoost container runs the training job.
- SageMaker provisions the compute instance (
-
Model Artifacts:
- Trained model is saved to
s3://telco-machinelearning-churn-sagemaker-jhug/models
.
- Trained model is saved to
-
Logs and Metrics:
- Logs captured in CloudWatch.
- Training metrics displayed in SageMaker Studio.
Code Recap: Complete Workflow
Below is the complete code to configure and train the XGBoost model in SageMaker:
from sagemaker import image_uris, Session
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator
import sagemaker
# Retrieve the XGBoost container image
container = image_uris.retrieve(
framework='xgboost',
region=Session().boto_region_name,
version='1.5-1'
)
# Define the SageMaker Estimator
xgb = Estimator(
image_uri=container,
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.p3.2xlarge',
output_path='s3://telco-machinelearning-churn-sagemaker-jhug/models',
sagemaker_session=sagemaker.Session()
)
# Set hyperparameters
xgb.set_hyperparameters(
objective='binary:logistic',
num_round=700,
early_stopping_rounds=20,
max_depth=8,
eta=0.05,
subsample=0.9,
colsample_bytree=0.9,
gamma=2,
min_child_weight=5,
alpha=0.1,
eval_metric='aucpr'
)
# Define training input data
train_features_path = 's3://telco-machinelearning-churn-sagemaker-jhug/data/train_combined.csv'
train_input = TrainingInput(
s3_data=train_features_path,
content_type='text/csv'
)
# Run the training job
xgb.fit({'train': train_input})
By using the SageMaker SDK, we successfully trained an XGBoost model in SageMaker with custom hyperparameters and input data stored in S3. The trained model artifacts are saved back to S3 for deployment or further evaluation.