Objective
This article details the preprocessing steps necessary to prepare data for SageMaker training, focusing on handling missing values, balancing datasets, encoding categorical variables, and splitting datasets into training and testing sets. These steps ensure the model is trained on clean, well-structured data for better performance.
Why Preprocessing is Critical for ML Models
Machine learning models require structured and clean data to perform effectively. Preprocessing ensures:
- Missing values and outliers do not skew model performance.
- Class imbalances are addressed to avoid biased predictions.
- Categorical variables are encoded numerically for compatibility with ML algorithms.
- Data is split into training and testing sets to evaluate model performance reliably.
Handling Missing Values and Outliers
In this dataset, the TotalCharges
column contains missing values and potential outliers. We handle these issues as follows:
- Convert
TotalCharges
to numeric. - Replace missing values with
0
.
Code Example:
'''
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(0)
'''
Using SMOTE for Handling Class Imbalance
The dataset has a class imbalance, with fewer examples of customers who have churned. To address this, we use SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic samples for the minority class.
Code Example:
from imblearn.over_sampling import SMOTE
# Convert the target column to binary
df['Churn'] = (df['Churn'] == 'Yes').astype(int)
# Split into features and target
X = df.drop(columns=['Churn', 'customerID'])
y = df['Churn']
# Apply SMOTE for class balancing
smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X, y)
Dataset Before and After Balancing:
Class | Before Balancing | After Balancing |
---|---|---|
No Churn | 5,174 | 5,174 |
Yes Churn | 1,869 | 5,174 |
Encoding Categorical Variables
SageMaker models require numeric inputs. We use LabelEncoder
to encode categorical variables.
Code Example:
from sklearn.preprocessing import LabelEncoder
categorical_columns = df.select_dtypes(include=['object']).columns.difference(['customerID'])
label_encoders = {}
for col in categorical_columns:
le = LabelEncoder()
X[col] = le.fit_transform(X[col])
label_encoders[col] = le # Save encoder for future use
Aligning and Splitting Datasets
After encoding, we split the dataset into training and testing sets for evaluation.
Code Example:
from sklearn.model_selection import train_test_split
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X_balanced, y_balanced, test_size=0.2, random_state=42
)
Exporting Preprocessed Data for SageMaker Training
SageMaker requires data in CSV format for training. We export the training and testing sets to CSV files.
Code Example:
# Save training and testing data to CSV
X_train.to_csv('train_features.csv', index=False, header=False)
y_train.to_csv('train_labels.csv', index=False, header=False)
X_test.to_csv('test_features.csv', index=False, header=False)
y_test.to_csv('test_labels.csv', index=False, header=False)
Uploading Data to S3:
import boto3
s3 = boto3.client('s3')
bucket = 'telco-machinelearning-churn-sagemaker-jhug'
files_to_upload = {
'train_features.csv': 'data/train_features.csv',
'train_labels.csv': 'data/train_labels.csv',
'test_features.csv': 'data/test_features.csv',
'test_labels.csv': 'data/test_labels.csv'
}
for local_file, s3_key in files_to_upload.items():
s3.upload_file(local_file, bucket, s3_key)
print(f"Uploaded {local_file} to s3://{bucket}/{s3_key}")
Dataset Sample Preview Before and After Preprocessing
Before Preprocessing (Raw Data):
customerID | TotalCharges | Churn | Gender | InternetService |
---|---|---|---|---|
7590-VHVEG | 29.85 | No | Female | DSL |
5575-GNVDE | 1889.50 | Yes | Male | Fiber optic |
After Preprocessing (Encoded and Balanced Data):
TotalCharges | Churn | Gender | InternetService |
---|---|---|---|
29.85 | 0 | 1 | 0 |
1889.50 | 1 | 0 | 1 |
Conclusion
Preprocessing is a critical step in any machine learning workflow. By cleaning, balancing, and encoding the data, we set the foundation for effective SageMaker training.