AI/ML

Building the Churn Prediction Model with BigQuery ML

Todd Bernson, CTO, guides you through creating churn prediction models using Logistic Regression, XGBoost, and Deep Neural Networks (DNN) in Google BigQuery ML. This article covers model creation, configuration options, and best practices, making it easy to deploy and optimize churn prediction directly within GCP's data warehouse.

Todd Bernson

2024-11-14

Overview

Predicting customer churn is important in every industry, as it helps companies proactively retain customers at risk of leaving. For this project, we experiment with three popular model types: Logistic Regression, XGBoost, and Deep Neural Network (DNN). Each model offers unique advantages depending on the complexity, interpretability, and performance requirements.

Logistic Regression: Best suited for quickly interpretable results and cases where relationships between features are linear.
XGBoost: Often chosen for complex data with non-linear relationships and is effective at capturing intricate patterns.
DNN: Ideal for learning highly complex, non-linear interactions within the data, particularly when working with large datasets.

Using BigQuery ML allows us to train these models directly within the data warehouse, streamlining the process and keeping data in a centralized location.

Model Creation

We create each model using SQL commands in BigQuery ML, specifying relevant configuration options to tailor the model to our churn dataset. The SQL-based approach simplifies ML model building for those familiar with SQL, and BigQuery ML handles most of the optimization and scalability behind the scenes.

Logistic Regression Model Creation

Logistic Regression is a fast, interpretable model often used as a baseline for binary classification tasks, like predicting churn. It works well when the relationship between input features and the target is approximately linear.

CREATE OR REPLACE MODEL `<DATASET_NAME>.logistic_reg_model`
OPTIONS (
  model_type = 'LOGISTIC_REG',
  input_label_cols = ['churn'],
  data_split_method = 'AUTO_SPLIT',
  max_iterations = 10
) AS
SELECT *
FROM `<DATASET_NAME>.<TABLE_NAME>`
WHERE churn IS NOT NULL;

Explanation

model_type: Specifies the type of model as LOGISTIC_REG.
input_label_cols: Identifies churn as the target column.
data_split_method: Uses AUTO_SPLIT to automatically divide the data into training and evaluation sets.
max_iterations: Sets a maximum number of iterations for training, balancing speed with accuracy.

Logistic Regression provides quick results and insights, making it useful for initial model evaluation.

XGBoost Model Creation

XGBoost is a powerful gradient-boosting algorithm optimized for performance. It’s well-suited for datasets with complex relationships and is often chosen when interpretability is less of a priority than accuracy.

CREATE OR REPLACE MODEL `<DATASET_NAME>.xgboost_model`
OPTIONS (
  model_type = 'BOOSTED_TREE_CLASSIFIER',
  auto_class_weights = TRUE,
  data_split_method = 'RANDOM',
  data_split_eval_fraction = 0.2,
  input_label_cols = ['churn'],
  num_parallel_tree = 8,
  max_iterations = 50
) AS
SELECT *
FROM `<DATASET_NAME>.<TABLE_NAME>`
WHERE churn IS NOT NULL;

Explanation

model_type: Specifies a boosted tree classifier, which is XGBoost in BigQuery ML.
auto_class_weights: Automatically adjusts weights to address class imbalances.
data_split_method: Uses a RANDOM split, with data_split_eval_fraction set to 0.2 for a 20% evaluation set.
num_parallel_tree: Sets the number of parallel trees, optimizing for speed.
max_iterations: Defines the number of boosting iterations.

XGBoost is highly effective for capturing complex data patterns, making it a strong candidate for churn prediction.

DNN (Deep Neural Network) Model Creation

A DNN is a type of neural network that is adept at modeling complex, non-linear relationships. While it typically requires more data and computation, DNNs often provide high accuracy when dealing with intricate datasets.

CREATE OR REPLACE MODEL `<DATASET_NAME>.dnn_model`
OPTIONS (
  model_type = 'DNN_CLASSIFIER',
  hidden_units = [64, 32, 16],
  batch_size = 128,
  data_split_method = 'RANDOM',
  data_split_eval_fraction = 0.2,
  input_label_cols = ['churn']
) AS
SELECT *
FROM `<DATASET_NAME>.<TABLE_NAME>`
WHERE churn IS NOT NULL;

Explanation

model_type: Specifies a DNN classifier.
hidden_units: Sets up layers with 64, 32, and 16 neurons respectively, capturing complex patterns in the data.
batch_size: Defines the number of samples per batch, balancing training speed and memory usage.
data_split_method: Uses a RANDOM split with a 20% evaluation fraction.
input_label_cols: Sets churn as the target.

The DNN model is ideal for handling large datasets with non-linear relationships, though it may require more computational resources.

Model Configuration Options

When creating models in BigQuery ML, understanding the configuration options is essential to optimizing model performance:

Data Split: Options like AUTO_SPLIT and RANDOM help partition the data for training and evaluation. A consistent split ensures reproducibility across model runs.
Batch Size: For DNN models, a larger batch_size can speed up training, but may increase memory usage. Balancing this is key, especially for deep networks.
Max Iterations: This controls the number of training steps. For XGBoost, higher iterations lead to better performance but can increase training time and cost.

These configurations and model choices allow us to build a churn prediction model optimized for our specific dataset and business requirements. BigQuery ML simplifies the process by integrating model training directly within the data warehouse, allowing for rapid iteration and testing.

Todd Bernson

CTO

View all posts

AI/ML

Why Enterprise AI Must Be Application-Led, Not Agent-Led

A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson

2025-12-02

AI/ML

Application-First Agentic AI

Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson

2025-11-28

AI/ML

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton

2025-08-22

Building the Churn Prediction Model with BigQuery ML

Overview

Model Creation

Logistic Regression Model Creation

Explanation

XGBoost Model Creation

Explanation

DNN (Deep Neural Network) Model Creation

Explanation

Model Configuration Options

Read More

Why Enterprise AI Must Be Application-Led, Not Agent-Led

Application-First Agentic AI

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed