Automating Customer Churn Data Pipelines with Azure Storage and ML Studio
This article demonstrates how to automate customer churn data pipelines using Azure Blob Storage for data ingestion and Azure Machine Learning Studio for preprocessing and training. The workflow integrates Terraform for infrastructure deployment, setting up storage accounts and containers to host datasets. Azure ML pipelines streamline data preprocessing and model training, leveraging Python scripts and Azure SDK configurations. The approach ensures scalability, seamless integration of new data updates, and efficient monitoring of pipeline runs in Azure ML Studio, offering a robust solution for managing machine learning workflows.

Todd Bernson
2024-12-23

For machine learning workflows to operate effectively, seamless data ingestion and preprocessing are essential. Azure Storage and Azure Machine Learning Studio provide a robust platform to automate these tasks, ensuring data flows efficiently into training pipelines.
This article demonstrates how to automate customer churn data pipelines using Azure Blob Storage for data ingestion and Azure ML pipelines for preprocessing and training. We’ll cover:
- Uploading datasets to Azure Blob Storage with Terraform.
- Setting up automated pipelines in Azure ML Studio.
- Integrating new data updates into the training process.
Step 1: Uploading Datasets to Azure Blob Storage Using Terraform
Azure Blob Storage serves as the central repository for your datasets. Below is a Terraform configuration for creating a storage account, container, and uploading the churn dataset.
Terraform Configuration
Create a storage.tf file:
resource "azurerm_storage_account" "churn_storage" {
name = "mlchurnstorage"
resource_group_name = azurerm_resource_group.this.name
location = azurerm_resource_group.this.location
account_tier = "Standard"
account_replication_type = "LRS"
tags = {
environment = "production"
}
}
resource "azurerm_storage_container" "data_container" {
name = "customer-churn-data"
storage_account_name = azurerm_storage_account.churn_storage.name
container_access_type = "private"
}
resource "azurerm_storage_blob" "churn_data" {
name = "customer_churn.csv"
storage_account_name = azurerm_storage_account.churn_storage.name
storage_container_name = azurerm_storage_container.data_container.name
type = "Block"
source = "data/customer_churn.csv"
}
Deployment Steps
- Initialize Terraform:
terraform init - Apply the configuration:
terraform apply
Your dataset is now available in Azure Blob Storage under customer-churn-data container.
Step 2: Setting Up Automated Data Pipelines in Azure Machine Learning Studio
Azure Machine Learning pipelines automate data preprocessing and model training. The following steps demonstrate how to configure and run an ML pipeline using the Azure ML SDK.
Pipeline Configuration Script
Create a Python script automated_pipeline.py:
from azureml.core import Workspace, Dataset, Experiment
from azureml.pipeline.core import Pipeline, PipelineData, StepSequence
from azureml.pipeline.steps import PythonScriptStep
from azureml.data import OutputFileDatasetConfig
# Connect to Azure ML Workspace
ws = Workspace.from_config()
# Define Dataset from Azure Blob Storage
datastore = ws.datastores['mlchurnstorage']
input_data = Dataset.File.from_files((datastore, 'customer-churn-data/customer_churn.csv'))
# Define Output for Preprocessed Data
preprocessed_data = OutputFileDatasetConfig(destination=(datastore, 'preprocessed_data'))
# Step 1: Preprocessing Script
preprocess_step = PythonScriptStep(
name="Preprocess Data",
script_name="preprocess_data.py",
arguments=["--input-data", input_data.as_input(), "--output-data", preprocessed_data],
compute_target="cpu-cluster",
source_directory="scripts"
)
# Step 2: Model Training Script
train_step = PythonScriptStep(
name="Train Model",
script_name="train_model.py",
arguments=["--input-data", preprocessed_data.as_input()],
compute_target="gpu-cluster",
source_directory="scripts"
)
# Build and Run Pipeline
pipeline = Pipeline(workspace=ws, steps=StepSequence([preprocess_step, train_step]))
experiment = Experiment(ws, "automated-churn-pipeline")
pipeline_run = experiment.submit(pipeline)
pipeline_run.wait_for_completion(show_output=True)
Step 3: Preprocessing the Data
The preprocess_data.py script handles data cleaning, feature engineering, and scaling.
import argparse
import pandas as pd
from azureml.core import Run
# Parse Arguments
parser = argparse.ArgumentParser()
parser.add_argument("--input-data", type=str)
parser.add_argument("--output-data", type=str)
args = parser.parse_args()
# Load Data
data = pd.read_csv(args.input_data + "/customer_churn.csv")
# Preprocess Data
data = data.dropna()
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce').fillna(0)
# Feature Scaling (Standardization)
data['tenure'] = (data['tenure'] - data['tenure'].mean()) / data['tenure'].std()
# Save Processed Data
data.to_csv(args.output_data + "/preprocessed_churn.csv", index=False)
Step 4: Integrating New Data Updates into Training
To integrate new data updates seamlessly:
- Automate dataset updates using a scheduled job to upload new files to Azure Blob Storage.
- Trigger Azure ML pipelines automatically whenever new data arrives. This can be done using Azure Data Factory or Event Grid.
Example: Azure Data Factory triggers a pipeline when a new file is uploaded.
Step 5: Monitoring Pipeline Runs in Azure ML Studio
- Navigate to Azure ML Studio > Experiments.
- Select the pipeline run to view its status, logs, and outputs.
- Review metrics such as preprocessing time, training time, and model performance.
Key Takeaway
Automating data pipelines with Azure Storage and Machine Learning Studio simplifies the ingestion, preprocessing, and training of machine learning models. By combining Terraform for infrastructure deployment and Azure ML pipelines for automation, you can create a seamless, scalable workflow that keeps your models up-to-date with new data.
Read More
View all posts
AI/ML
Why Enterprise AI Must Be Application-Led, Not Agent-Led
A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson
2025-12-02

AI/ML
Application-First Agentic AI
Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson
2025-11-28
AI/ML
Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton
2025-08-22