BSC Analytics

Preprocessing and Preparing Data for SageMaker

By Todd Bernson
Chief Technical Officer BSC Analytics

12 Dec 2024

Objective

This article details the preprocessing steps necessary to prepare data for SageMaker training, focusing on handling missing values, balancing datasets, encoding categorical variables, and splitting datasets into training and testing sets. These steps ensure the model is trained on clean, well-structured data for better performance.

Why Preprocessing is Critical for ML Models

Machine learning models require structured and clean data to perform effectively. Preprocessing ensures:

Missing values and outliers do not skew model performance.
Class imbalances are addressed to avoid biased predictions.
Categorical variables are encoded numerically for compatibility with ML algorithms.
Data is split into training and testing sets to evaluate model performance reliably.

Handling Missing Values and Outliers

In this dataset, the TotalCharges column contains missing values and potential outliers. We handle these issues as follows:

Convert TotalCharges to numeric.
Replace missing values with 0.

Code Example:
'''
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(0)
'''

Using SMOTE for Handling Class Imbalance

The dataset has a class imbalance, with fewer examples of customers who have churned. To address this, we use SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic samples for the minority class.

Code Example:

from imblearn.over_sampling import SMOTE

# Convert the target column to binary
df['Churn'] = (df['Churn'] == 'Yes').astype(int)

# Split into features and target
X = df.drop(columns=['Churn', 'customerID'])
y = df['Churn']

# Apply SMOTE for class balancing
smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X, y)

Dataset Before and After Balancing:

Class	Before Balancing	After Balancing
No Churn	5,174	5,174
Yes Churn	1,869	5,174

Encoding Categorical Variables

SageMaker models require numeric inputs. We use LabelEncoder to encode categorical variables.

Code Example:

from sklearn.preprocessing import LabelEncoder

categorical_columns = df.select_dtypes(include=['object']).columns.difference(['customerID'])
label_encoders = {}

for col in categorical_columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le  # Save encoder for future use

Aligning and Splitting Datasets

After encoding, we split the dataset into training and testing sets for evaluation.

Code Example:

from sklearn.model_selection import train_test_split

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_balanced, y_balanced, test_size=0.2, random_state=42
)

Exporting Preprocessed Data for SageMaker Training

SageMaker requires data in CSV format for training. We export the training and testing sets to CSV files.

Code Example:

# Save training and testing data to CSV
X_train.to_csv('train_features.csv', index=False, header=False)
y_train.to_csv('train_labels.csv', index=False, header=False)
X_test.to_csv('test_features.csv', index=False, header=False)
y_test.to_csv('test_labels.csv', index=False, header=False)

Uploading Data to S3:

import boto3

s3 = boto3.client('s3')
bucket = 'telco-machinelearning-churn-sagemaker-jhug'

files_to_upload = {
    'train_features.csv': 'data/train_features.csv',
    'train_labels.csv': 'data/train_labels.csv',
    'test_features.csv': 'data/test_features.csv',
    'test_labels.csv': 'data/test_labels.csv'
}

for local_file, s3_key in files_to_upload.items():
    s3.upload_file(local_file, bucket, s3_key)
    print(f"Uploaded {local_file} to s3://{bucket}/{s3_key}")

Dataset Sample Preview Before and After Preprocessing

Before Preprocessing (Raw Data):

customerID	TotalCharges	Churn	Gender	InternetService
7590-VHVEG	29.85	No	Female	DSL
5575-GNVDE	1889.50	Yes	Male	Fiber optic

After Preprocessing (Encoded and Balanced Data):

TotalCharges	Churn	Gender	InternetService
29.85	0	1	0
1889.50	1	0	1

Conclusion

Preprocessing is a critical step in any machine learning workflow. By cleaning, balancing, and encoding the data, we set the foundation for effective SageMaker training.

Sagemaker, AWS, Machine Learning, Python

Preprocessing and Preparing Data for SageMaker

Objective

Why Preprocessing is Critical for ML Models

Handling Missing Values and Outliers

Using SMOTE for Handling Class Imbalance

Encoding Categorical Variables

Aligning and Splitting Datasets

Exporting Preprocessed Data for SageMaker Training

Dataset Sample Preview Before and After Preprocessing

Conclusion

Related Posts

Related Articles

Inter-Region WireGuard VPN in AWS

Making PDFs Searchable Using AWS Textract and CloudSearch

Slack AI Bot with AWS Bedrock Part 2

Contact Us