Skip to content
Preprocessing and Preparing Data for SageMaker
todd-bernson-leadership

Objective

This article details the preprocessing steps necessary to prepare data for SageMaker training, focusing on handling missing values, balancing datasets, encoding categorical variables, and splitting datasets into training and testing sets. These steps ensure the model is trained on clean, well-structured data for better performance.


Why Preprocessing is Critical for ML Models

Machine learning models require structured and clean data to perform effectively. Preprocessing ensures:

  • Missing values and outliers do not skew model performance.
  • Class imbalances are addressed to avoid biased predictions.
  • Categorical variables are encoded numerically for compatibility with ML algorithms.
  • Data is split into training and testing sets to evaluate model performance reliably.

Handling Missing Values and Outliers

In this dataset, the TotalCharges column contains missing values and potential outliers. We handle these issues as follows:

  1. Convert TotalCharges to numeric.
  2. Replace missing values with 0.

Code Example:
'''
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(0)
'''


Using SMOTE for Handling Class Imbalance

The dataset has a class imbalance, with fewer examples of customers who have churned. To address this, we use SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic samples for the minority class.

Code Example:

from imblearn.over_sampling import SMOTE

# Convert the target column to binary
df['Churn'] = (df['Churn'] == 'Yes').astype(int)

# Split into features and target
X = df.drop(columns=['Churn', 'customerID'])
y = df['Churn']

# Apply SMOTE for class balancing
smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X, y)

Dataset Before and After Balancing:

Class Before Balancing After Balancing
No Churn 5,174 5,174
Yes Churn 1,869 5,174

Encoding Categorical Variables

SageMaker models require numeric inputs. We use LabelEncoder to encode categorical variables.

Code Example:

from sklearn.preprocessing import LabelEncoder

categorical_columns = df.select_dtypes(include=['object']).columns.difference(['customerID'])
label_encoders = {}

for col in categorical_columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le  # Save encoder for future use

Aligning and Splitting Datasets

After encoding, we split the dataset into training and testing sets for evaluation.

Code Example:

from sklearn.model_selection import train_test_split

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_balanced, y_balanced, test_size=0.2, random_state=42
)

Exporting Preprocessed Data for SageMaker Training

SageMaker requires data in CSV format for training. We export the training and testing sets to CSV files.

Code Example:

# Save training and testing data to CSV
X_train.to_csv('train_features.csv', index=False, header=False)
y_train.to_csv('train_labels.csv', index=False, header=False)
X_test.to_csv('test_features.csv', index=False, header=False)
y_test.to_csv('test_labels.csv', index=False, header=False)

Uploading Data to S3:

import boto3

s3 = boto3.client('s3')
bucket = 'telco-machinelearning-churn-sagemaker-jhug'

files_to_upload = {
    'train_features.csv': 'data/train_features.csv',
    'train_labels.csv': 'data/train_labels.csv',
    'test_features.csv': 'data/test_features.csv',
    'test_labels.csv': 'data/test_labels.csv'
}

for local_file, s3_key in files_to_upload.items():
    s3.upload_file(local_file, bucket, s3_key)
    print(f"Uploaded {local_file} to s3://{bucket}/{s3_key}")

Dataset Sample Preview Before and After Preprocessing

Before Preprocessing (Raw Data):

customerID TotalCharges Churn Gender InternetService
7590-VHVEG 29.85 No Female DSL
5575-GNVDE 1889.50 Yes Male Fiber optic

After Preprocessing (Encoded and Balanced Data):

TotalCharges Churn Gender InternetService
29.85 0 1 0
1889.50 1 0 1

Conclusion

Preprocessing is a critical step in any machine learning workflow. By cleaning, balancing, and encoding the data, we set the foundation for effective SageMaker training.

Related Articles

Inter-Region WireGuard VPN in AWS

Read more

Making PDFs Searchable Using AWS Textract and CloudSearch

Read more

Slack AI Bot with AWS Bedrock Part 2

Read more

Contact Us

Achieve a competitive advantage through BSC data analytics and cloud solutions.

Contact Us