Building the Churn Prediction Model with BigQuery ML
Todd Bernson, CTO, guides you through creating churn prediction models using Logistic Regression, XGBoost, and Deep Neural Networks (DNN) in Google BigQuery ML. This article covers model creation, configuration options, and best practices, making it easy to deploy and optimize churn prediction directly within GCP's data warehouse.

Todd Bernson
2024-11-14
Overview
Predicting customer churn is important in every industry, as it helps companies proactively retain customers at risk of leaving. For this project, we experiment with three popular model types: Logistic Regression, XGBoost, and Deep Neural Network (DNN). Each model offers unique advantages depending on the complexity, interpretability, and performance requirements.
- Logistic Regression: Best suited for quickly interpretable results and cases where relationships between features are linear.
- XGBoost: Often chosen for complex data with non-linear relationships and is effective at capturing intricate patterns.
- DNN: Ideal for learning highly complex, non-linear interactions within the data, particularly when working with large datasets.
Using BigQuery ML allows us to train these models directly within the data warehouse, streamlining the process and keeping data in a centralized location.
Model Creation
We create each model using SQL commands in BigQuery ML, specifying relevant configuration options to tailor the model to our churn dataset. The SQL-based approach simplifies ML model building for those familiar with SQL, and BigQuery ML handles most of the optimization and scalability behind the scenes.
Logistic Regression Model Creation
Logistic Regression is a fast, interpretable model often used as a baseline for binary classification tasks, like predicting churn. It works well when the relationship between input features and the target is approximately linear.
CREATE OR REPLACE MODEL `<DATASET_NAME>.logistic_reg_model`
OPTIONS (
model_type = 'LOGISTIC_REG',
input_label_cols = ['churn'],
data_split_method = 'AUTO_SPLIT',
max_iterations = 10
) AS
SELECT *
FROM `<DATASET_NAME>.<TABLE_NAME>`
WHERE churn IS NOT NULL;
Explanation
- model_type: Specifies the type of model as
LOGISTIC_REG. - input_label_cols: Identifies
churnas the target column. - data_split_method: Uses
AUTO_SPLITto automatically divide the data into training and evaluation sets. - max_iterations: Sets a maximum number of iterations for training, balancing speed with accuracy.
Logistic Regression provides quick results and insights, making it useful for initial model evaluation.
XGBoost Model Creation
XGBoost is a powerful gradient-boosting algorithm optimized for performance. It’s well-suited for datasets with complex relationships and is often chosen when interpretability is less of a priority than accuracy.
CREATE OR REPLACE MODEL `<DATASET_NAME>.xgboost_model`
OPTIONS (
model_type = 'BOOSTED_TREE_CLASSIFIER',
auto_class_weights = TRUE,
data_split_method = 'RANDOM',
data_split_eval_fraction = 0.2,
input_label_cols = ['churn'],
num_parallel_tree = 8,
max_iterations = 50
) AS
SELECT *
FROM `<DATASET_NAME>.<TABLE_NAME>`
WHERE churn IS NOT NULL;
Explanation
- model_type: Specifies a boosted tree classifier, which is XGBoost in BigQuery ML.
- auto_class_weights: Automatically adjusts weights to address class imbalances.
- data_split_method: Uses a
RANDOMsplit, withdata_split_eval_fractionset to 0.2 for a 20% evaluation set. - num_parallel_tree: Sets the number of parallel trees, optimizing for speed.
- max_iterations: Defines the number of boosting iterations.
XGBoost is highly effective for capturing complex data patterns, making it a strong candidate for churn prediction.
DNN (Deep Neural Network) Model Creation
A DNN is a type of neural network that is adept at modeling complex, non-linear relationships. While it typically requires more data and computation, DNNs often provide high accuracy when dealing with intricate datasets.
CREATE OR REPLACE MODEL `<DATASET_NAME>.dnn_model`
OPTIONS (
model_type = 'DNN_CLASSIFIER',
hidden_units = [64, 32, 16],
batch_size = 128,
data_split_method = 'RANDOM',
data_split_eval_fraction = 0.2,
input_label_cols = ['churn']
) AS
SELECT *
FROM `<DATASET_NAME>.<TABLE_NAME>`
WHERE churn IS NOT NULL;
Explanation
- model_type: Specifies a DNN classifier.
- hidden_units: Sets up layers with 64, 32, and 16 neurons respectively, capturing complex patterns in the data.
- batch_size: Defines the number of samples per batch, balancing training speed and memory usage.
- data_split_method: Uses a
RANDOMsplit with a 20% evaluation fraction. - input_label_cols: Sets
churnas the target.
The DNN model is ideal for handling large datasets with non-linear relationships, though it may require more computational resources.
Model Configuration Options
When creating models in BigQuery ML, understanding the configuration options is essential to optimizing model performance:
- Data Split: Options like
AUTO_SPLITandRANDOMhelp partition the data for training and evaluation. A consistent split ensures reproducibility across model runs. - Batch Size: For DNN models, a larger
batch_sizecan speed up training, but may increase memory usage. Balancing this is key, especially for deep networks. - Max Iterations: This controls the number of training steps. For XGBoost, higher iterations lead to better performance but can increase training time and cost.
These configurations and model choices allow us to build a churn prediction model optimized for our specific dataset and business requirements. BigQuery ML simplifies the process by integrating model training directly within the data warehouse, allowing for rapid iteration and testing.
Read More
View all posts
AI/ML
Why Enterprise AI Must Be Application-Led, Not Agent-Led
A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson
2025-12-02

AI/ML
Application-First Agentic AI
Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson
2025-11-28
AI/ML
Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton
2025-08-22