Overview
Predicting customer churn is important in every industry, as it helps companies proactively retain customers at risk of leaving. For this project, we experiment with three popular model types: Logistic Regression, XGBoost, and Deep Neural Network (DNN). Each model offers unique advantages depending on the complexity, interpretability, and performance requirements.
- Logistic Regression: Best suited for quickly interpretable results and cases where relationships between features are linear.
- XGBoost: Often chosen for complex data with non-linear relationships and is effective at capturing intricate patterns.
- DNN: Ideal for learning highly complex, non-linear interactions within the data, particularly when working with large datasets.
Using BigQuery ML allows us to train these models directly within the data warehouse, streamlining the process and keeping data in a centralized location.
Model Creation
We create each model using SQL commands in BigQuery ML, specifying relevant configuration options to tailor the model to our churn dataset. The SQL-based approach simplifies ML model building for those familiar with SQL, and BigQuery ML handles most of the optimization and scalability behind the scenes.
Logistic Regression Model Creation
Logistic Regression is a fast, interpretable model often used as a baseline for binary classification tasks, like predicting churn. It works well when the relationship between input features and the target is approximately linear.
CREATE OR REPLACE MODEL `<DATASET_NAME>.logistic_reg_model`
OPTIONS (
model_type = 'LOGISTIC_REG',
input_label_cols = ['churn'],
data_split_method = 'AUTO_SPLIT',
max_iterations = 10
) AS
SELECT *
FROM `<DATASET_NAME>.<TABLE_NAME>`
WHERE churn IS NOT NULL;
Explanation
- model_type: Specifies the type of model as
LOGISTIC_REG
. - input_label_cols: Identifies
churn
as the target column. - data_split_method: Uses
AUTO_SPLIT
to automatically divide the data into training and evaluation sets. - max_iterations: Sets a maximum number of iterations for training, balancing speed with accuracy.
Logistic Regression provides quick results and insights, making it useful for initial model evaluation.
XGBoost Model Creation
XGBoost is a powerful gradient-boosting algorithm optimized for performance. It’s well-suited for datasets with complex relationships and is often chosen when interpretability is less of a priority than accuracy.
CREATE OR REPLACE MODEL `<DATASET_NAME>.xgboost_model`
OPTIONS (
model_type = 'BOOSTED_TREE_CLASSIFIER',
auto_class_weights = TRUE,
data_split_method = 'RANDOM',
data_split_eval_fraction = 0.2,
input_label_cols = ['churn'],
num_parallel_tree = 8,
max_iterations = 50
) AS
SELECT *
FROM `<DATASET_NAME>.<TABLE_NAME>`
WHERE churn IS NOT NULL;
Explanation
- model_type: Specifies a boosted tree classifier, which is XGBoost in BigQuery ML.
- auto_class_weights: Automatically adjusts weights to address class imbalances.
- data_split_method: Uses a
RANDOM
split, withdata_split_eval_fraction
set to 0.2 for a 20% evaluation set. - num_parallel_tree: Sets the number of parallel trees, optimizing for speed.
- max_iterations: Defines the number of boosting iterations.
XGBoost is highly effective for capturing complex data patterns, making it a strong candidate for churn prediction.
DNN (Deep Neural Network) Model Creation
A DNN is a type of neural network that is adept at modeling complex, non-linear relationships. While it typically requires more data and computation, DNNs often provide high accuracy when dealing with intricate datasets.
CREATE OR REPLACE MODEL `<DATASET_NAME>.dnn_model`
OPTIONS (
model_type = 'DNN_CLASSIFIER',
hidden_units = [64, 32, 16],
batch_size = 128,
data_split_method = 'RANDOM',
data_split_eval_fraction = 0.2,
input_label_cols = ['churn']
) AS
SELECT *
FROM `<DATASET_NAME>.<TABLE_NAME>`
WHERE churn IS NOT NULL;
Explanation
- model_type: Specifies a DNN classifier.
- hidden_units: Sets up layers with 64, 32, and 16 neurons respectively, capturing complex patterns in the data.
- batch_size: Defines the number of samples per batch, balancing training speed and memory usage.
- data_split_method: Uses a
RANDOM
split with a 20% evaluation fraction. - input_label_cols: Sets
churn
as the target.
The DNN model is ideal for handling large datasets with non-linear relationships, though it may require more computational resources.
Model Configuration Options
When creating models in BigQuery ML, understanding the configuration options is essential to optimizing model performance:
- Data Split: Options like
AUTO_SPLIT
andRANDOM
help partition the data for training and evaluation. A consistent split ensures reproducibility across model runs. - Batch Size: For DNN models, a larger
batch_size
can speed up training, but may increase memory usage. Balancing this is key, especially for deep networks. - Max Iterations: This controls the number of training steps. For XGBoost, higher iterations lead to better performance but can increase training time and cost.
These configurations and model choices allow us to build a churn prediction model optimized for our specific dataset and business requirements. BigQuery ML simplifies the process by integrating model training directly within the data warehouse, allowing for rapid iteration and testing.