Back to Insights
AI/ML

Training a Model in SageMaker

This article demonstrates how to train an XGBoost model using AWS SageMaker and the SageMaker SDK. By leveraging SageMaker's prebuilt XGBoost containers and defining hyperparameters such as max_depth, eta, and num_round, the training process is both efficient and scalable. Training input data is sourced directly from an S3 bucket, and the model artifacts are stored in S3 after training. The streamlined process, configured programmatically, integrates seamlessly with SageMaker Studio for monitoring metrics like AUC-PR. This approach simplifies machine learning workflows, enabling scalable and reliable model training.

Todd Bernson

2024-12-16

Objective

This article walks through training an XGBoost model in SageMaker using the SageMaker SDK. We'll cover how to configure the XGBoost container, set hyperparameters, prepare input data, and execute the training job.


Introduction to XGBoost in SageMaker

XGBoost is a highly efficient and scalable gradient boosting algorithm for supervised learning tasks like classification and regression. AWS SageMaker provides prebuilt XGBoost containers optimized for distributed training and inference.

We use the SageMaker SDK to configure and run the training job programmatically.


Retrieving the XGBoost Container

To use XGBoost in SageMaker, we retrieve the optimized container image for the AWS region and specific version.

Code Example:

from sagemaker import image_uris, Session

# Get the XGBoost container image
container = image_uris.retrieve(
    framework='xgboost',
    region=Session().boto_region_name,
    version='1.5-1'
)
print(f"XGBoost container image: {container}")

The container image ensures SageMaker can load XGBoost and run the training job seamlessly.


Defining Hyperparameters

Hyperparameters control the XGBoost training process and performance. For this task, the following hyperparameters were selected:

  • objective: binary:logistic for binary classification.
  • num_round: Number of boosting rounds.
  • max_depth: Maximum depth of a tree.
  • eta: Learning rate.
  • subsample and colsample_bytree: Control data sampling.

Code Example:

import sagemaker
from sagemaker.estimator import Estimator

xgb = Estimator(
    image_uri=container,
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    output_path='s3://telco-machinelearning-churn-sagemaker-jhug/models',
    sagemaker_session=sagemaker.Session()
)

xgb.set_hyperparameters(
    objective='binary:logistic',
    num_round=700,
    early_stopping_rounds=20,
    max_depth=8,
    eta=0.05,
    subsample=0.9,
    colsample_bytree=0.9,
    gamma=2,
    min_child_weight=5,
    alpha=0.1,
    eval_metric='aucpr'
)

Specifying Training Data in S3

The training input data was preprocessed and stored in S3 as train_combined.csv. We configure SageMaker to read the input data directly from S3.

Code Example:

from sagemaker.inputs import TrainingInput

train_features_path = 's3://telco-machinelearning-churn-sagemaker-jhug/data/train_combined.csv'

train_input = TrainingInput(
    s3_data=train_features_path,
    content_type='text/csv'
)

Executing the Training Job

The training job is initiated using the SageMaker SDK, with the specified hyperparameters, container image, and input data.

Code Example:

xgb.fit({'train': train_input})
print("Training job completed successfully.")

Once the training is complete:

  • Model artifacts are saved to the specified S3 path.
  • Logs and metrics (e.g., AUC-PR, accuracy) are visible in Amazon SageMaker Studio or CloudWatch.

Visual Representation of Training Workflow

Below is a simplified representation of the SageMaker training process:

  1. Data Preparation:
    • Preprocessed data stored in S3.
  2. Training Job Execution:
    • SageMaker provisions the compute instance (ml.p3.2xlarge).
    • XGBoost container runs the training job.
  3. Model Artifacts:
    • Trained model is saved to s3://telco-machinelearning-churn-sagemaker-jhug/models.
  4. Logs and Metrics:
    • Logs captured in CloudWatch.
    • Training metrics displayed in SageMaker Studio.

Code Recap: Complete Workflow

Below is the complete code to configure and train the XGBoost model in SageMaker:

from sagemaker import image_uris, Session
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator
import sagemaker

# Retrieve the XGBoost container image
container = image_uris.retrieve(
    framework='xgboost',
    region=Session().boto_region_name,
    version='1.5-1'
)

# Define the SageMaker Estimator
xgb = Estimator(
    image_uri=container,
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    output_path='s3://telco-machinelearning-churn-sagemaker-jhug/models',
    sagemaker_session=sagemaker.Session()
)

# Set hyperparameters
xgb.set_hyperparameters(
    objective='binary:logistic',
    num_round=700,
    early_stopping_rounds=20,
    max_depth=8,
    eta=0.05,
    subsample=0.9,
    colsample_bytree=0.9,
    gamma=2,
    min_child_weight=5,
    alpha=0.1,
    eval_metric='aucpr'
)

# Define training input data
train_features_path = 's3://telco-machinelearning-churn-sagemaker-jhug/data/train_combined.csv'
train_input = TrainingInput(
    s3_data=train_features_path,
    content_type='text/csv'
)

# Run the training job
xgb.fit({'train': train_input})

By using the SageMaker SDK, we successfully trained an XGBoost model in SageMaker with custom hyperparameters and input data stored in S3. The trained model artifacts are saved back to S3 for deployment or further evaluation.

Todd Bernson

CTO