BSC Analytics

Building the Churn Prediction Model with BigQuery ML

By Todd Bernson
Chief Technical Officer BSC Analytics

14 Nov 2024

Overview

Predicting customer churn is important in every industry, as it helps companies proactively retain customers at risk of leaving. For this project, we experiment with three popular model types: Logistic Regression, XGBoost, and Deep Neural Network (DNN). Each model offers unique advantages depending on the complexity, interpretability, and performance requirements.

Logistic Regression: Best suited for quickly interpretable results and cases where relationships between features are linear.
XGBoost: Often chosen for complex data with non-linear relationships and is effective at capturing intricate patterns.
DNN: Ideal for learning highly complex, non-linear interactions within the data, particularly when working with large datasets.

Using BigQuery ML allows us to train these models directly within the data warehouse, streamlining the process and keeping data in a centralized location.

Model Creation

We create each model using SQL commands in BigQuery ML, specifying relevant configuration options to tailor the model to our churn dataset. The SQL-based approach simplifies ML model building for those familiar with SQL, and BigQuery ML handles most of the optimization and scalability behind the scenes.

Logistic Regression Model Creation

Logistic Regression is a fast, interpretable model often used as a baseline for binary classification tasks, like predicting churn. It works well when the relationship between input features and the target is approximately linear.

CREATE OR REPLACE MODEL `<DATASET_NAME>.logistic_reg_model`
OPTIONS (
  model_type = 'LOGISTIC_REG',
  input_label_cols = ['churn'],
  data_split_method = 'AUTO_SPLIT',
  max_iterations = 10
) AS
SELECT *
FROM `<DATASET_NAME>.<TABLE_NAME>`
WHERE churn IS NOT NULL;

Explanation

model_type: Specifies the type of model as LOGISTIC_REG.
input_label_cols: Identifies churn as the target column.
data_split_method: Uses AUTO_SPLIT to automatically divide the data into training and evaluation sets.
max_iterations: Sets a maximum number of iterations for training, balancing speed with accuracy.

Logistic Regression provides quick results and insights, making it useful for initial model evaluation.

XGBoost Model Creation

XGBoost is a powerful gradient-boosting algorithm optimized for performance. It’s well-suited for datasets with complex relationships and is often chosen when interpretability is less of a priority than accuracy.

CREATE OR REPLACE MODEL `<DATASET_NAME>.xgboost_model`
OPTIONS (
  model_type = 'BOOSTED_TREE_CLASSIFIER',
  auto_class_weights = TRUE,
  data_split_method = 'RANDOM',
  data_split_eval_fraction = 0.2,
  input_label_cols = ['churn'],
  num_parallel_tree = 8,
  max_iterations = 50
) AS
SELECT *
FROM `<DATASET_NAME>.<TABLE_NAME>`
WHERE churn IS NOT NULL;

Explanation

model_type: Specifies a boosted tree classifier, which is XGBoost in BigQuery ML.
auto_class_weights: Automatically adjusts weights to address class imbalances.
data_split_method: Uses a RANDOM split, with data_split_eval_fraction set to 0.2 for a 20% evaluation set.
num_parallel_tree: Sets the number of parallel trees, optimizing for speed.
max_iterations: Defines the number of boosting iterations.

XGBoost is highly effective for capturing complex data patterns, making it a strong candidate for churn prediction.

DNN (Deep Neural Network) Model Creation

A DNN is a type of neural network that is adept at modeling complex, non-linear relationships. While it typically requires more data and computation, DNNs often provide high accuracy when dealing with intricate datasets.

CREATE OR REPLACE MODEL `<DATASET_NAME>.dnn_model`
OPTIONS (
  model_type = 'DNN_CLASSIFIER',
  hidden_units = [64, 32, 16],
  batch_size = 128,
  data_split_method = 'RANDOM',
  data_split_eval_fraction = 0.2,
  input_label_cols = ['churn']
) AS
SELECT *
FROM `<DATASET_NAME>.<TABLE_NAME>`
WHERE churn IS NOT NULL;

Explanation

model_type: Specifies a DNN classifier.
hidden_units: Sets up layers with 64, 32, and 16 neurons respectively, capturing complex patterns in the data.
batch_size: Defines the number of samples per batch, balancing training speed and memory usage.
data_split_method: Uses a RANDOM split with a 20% evaluation fraction.
input_label_cols: Sets churn as the target.

The DNN model is ideal for handling large datasets with non-linear relationships, though it may require more computational resources.

Model Configuration Options

When creating models in BigQuery ML, understanding the configuration options is essential to optimizing model performance:

Data Split: Options like AUTO_SPLIT and RANDOM help partition the data for training and evaluation. A consistent split ensures reproducibility across model runs.
Batch Size: For DNN models, a larger batch_size can speed up training, but may increase memory usage. Balancing this is key, especially for deep networks.
Max Iterations: This controls the number of training steps. For XGBoost, higher iterations lead to better performance but can increase training time and cost.

These configurations and model choices allow us to build a churn prediction model optimized for our specific dataset and business requirements. BigQuery ML simplifies the process by integrating model training directly within the data warehouse, allowing for rapid iteration and testing.

Google Cloud, SQL, Machine Learning, BigQuery

Building the Churn Prediction Model with BigQuery ML

Overview

Model Creation

Logistic Regression Model Creation

Explanation

XGBoost Model Creation

Explanation

DNN (Deep Neural Network) Model Creation

Explanation

Model Configuration Options

Related Posts

Related Articles

Inter-Region WireGuard VPN in AWS

Making PDFs Searchable Using AWS Textract and CloudSearch

Slack AI Bot with AWS Bedrock Part 2

Contact Us