BSC Analytics

Automating Customer Churn Data Pipelines with Azure Storage and ML Studio

By Todd Bernson
Chief Technical Officer BSC Analytics

23 Dec 2024

For machine learning workflows to operate effectively, seamless data ingestion and preprocessing are essential. Azure Storage and Azure Machine Learning Studio provide a robust platform to automate these tasks, ensuring data flows efficiently into training pipelines.

This article demonstrates how to automate customer churn data pipelines using Azure Blob Storage for data ingestion and Azure ML pipelines for preprocessing and training. We’ll cover:

Uploading datasets to Azure Blob Storage with Terraform.
Setting up automated pipelines in Azure ML Studio.
Integrating new data updates into the training process.

Step 1: Uploading Datasets to Azure Blob Storage Using Terraform

Azure Blob Storage serves as the central repository for your datasets. Below is a Terraform configuration for creating a storage account, container, and uploading the churn dataset.

Terraform Configuration

Create a storage.tf file:

resource "azurerm_storage_account" "churn_storage" {
  name                     = "mlchurnstorage"
  resource_group_name      = azurerm_resource_group.this.name
  location                 = azurerm_resource_group.this.location
  account_tier             = "Standard"
  account_replication_type = "LRS"

  tags = {
    environment = "production"
  }
}

resource "azurerm_storage_container" "data_container" {
  name                  = "customer-churn-data"
  storage_account_name  = azurerm_storage_account.churn_storage.name
  container_access_type = "private"
}

resource "azurerm_storage_blob" "churn_data" {
  name                   = "customer_churn.csv"
  storage_account_name   = azurerm_storage_account.churn_storage.name
  storage_container_name = azurerm_storage_container.data_container.name
  type                   = "Block"
  source                 = "data/customer_churn.csv"
}

Deployment Steps

Initialize Terraform:
```
terraform init
```
Apply the configuration:
```
terraform apply
```

Your dataset is now available in Azure Blob Storage under customer-churn-data container.

Step 2: Setting Up Automated Data Pipelines in Azure Machine Learning Studio

Azure Machine Learning pipelines automate data preprocessing and model training. The following steps demonstrate how to configure and run an ML pipeline using the Azure ML SDK.

Pipeline Configuration Script

Create a Python script automated_pipeline.py:

from azureml.core import Workspace, Dataset, Experiment
from azureml.pipeline.core import Pipeline, PipelineData, StepSequence
from azureml.pipeline.steps import PythonScriptStep
from azureml.data import OutputFileDatasetConfig

# Connect to Azure ML Workspace
ws = Workspace.from_config()

# Define Dataset from Azure Blob Storage
datastore = ws.datastores['mlchurnstorage']
input_data = Dataset.File.from_files((datastore, 'customer-churn-data/customer_churn.csv'))

# Define Output for Preprocessed Data
preprocessed_data = OutputFileDatasetConfig(destination=(datastore, 'preprocessed_data'))

# Step 1: Preprocessing Script
preprocess_step = PythonScriptStep(
    name="Preprocess Data",
    script_name="preprocess_data.py",
    arguments=["--input-data", input_data.as_input(), "--output-data", preprocessed_data],
    compute_target="cpu-cluster",
    source_directory="scripts"
)

# Step 2: Model Training Script
train_step = PythonScriptStep(
    name="Train Model",
    script_name="train_model.py",
    arguments=["--input-data", preprocessed_data.as_input()],
    compute_target="gpu-cluster",
    source_directory="scripts"
)

# Build and Run Pipeline
pipeline = Pipeline(workspace=ws, steps=StepSequence([preprocess_step, train_step]))
experiment = Experiment(ws, "automated-churn-pipeline")
pipeline_run = experiment.submit(pipeline)
pipeline_run.wait_for_completion(show_output=True)

Step 3: Preprocessing the Data

The preprocess_data.py script handles data cleaning, feature engineering, and scaling.

import argparse
import pandas as pd
from azureml.core import Run

# Parse Arguments
parser = argparse.ArgumentParser()
parser.add_argument("--input-data", type=str)
parser.add_argument("--output-data", type=str)
args = parser.parse_args()

# Load Data
data = pd.read_csv(args.input_data + "/customer_churn.csv")

# Preprocess Data
data = data.dropna()
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce').fillna(0)

# Feature Scaling (Standardization)
data['tenure'] = (data['tenure'] - data['tenure'].mean()) / data['tenure'].std()

# Save Processed Data
data.to_csv(args.output_data + "/preprocessed_churn.csv", index=False)

Step 4: Integrating New Data Updates into Training

To integrate new data updates seamlessly:

Automate dataset updates using a scheduled job to upload new files to Azure Blob Storage.
Trigger Azure ML pipelines automatically whenever new data arrives. This can be done using Azure Data Factory or Event Grid.

Example: Azure Data Factory triggers a pipeline when a new file is uploaded.

Step 5: Monitoring Pipeline Runs in Azure ML Studio

Navigate to Azure ML Studio > Experiments.
Select the pipeline run to view its status, logs, and outputs.
Review metrics such as preprocessing time, training time, and model performance.

Key Takeaway

Automating data pipelines with Azure Storage and Machine Learning Studio simplifies the ingestion, preprocessing, and training of machine learning models. By combining Terraform for infrastructure deployment and Azure ML pipelines for automation, you can create a seamless, scalable workflow that keeps your models up-to-date with new data.

Microsoft Azure, Machine Learning, AI and ML, Python

Automating Customer Churn Data Pipelines with Azure Storage and ML Studio

Step 1: Uploading Datasets to Azure Blob Storage Using Terraform

Terraform Configuration

Deployment Steps

Step 2: Setting Up Automated Data Pipelines in Azure Machine Learning Studio

Pipeline Configuration Script

Step 3: Preprocessing the Data

Step 4: Integrating New Data Updates into Training

Step 5: Monitoring Pipeline Runs in Azure ML Studio

Key Takeaway

Related Posts

Related Articles

Inter-Region WireGuard VPN in AWS

Making PDFs Searchable Using AWS Textract and CloudSearch

Slack AI Bot with AWS Bedrock Part 2

Contact Us