Evaluating Model Performance: Metrics and Insights
This article provides an in-depth evaluation of customer churn prediction models using metrics like AUC-ROC, F1 Score, Precision, and Recall. It compares Logistic Regression, XGBoost, and Deep Neural Networks (DNN), analyzing their strengths and weaknesses. SQL-based BigQuery ML commands demonstrate how to assess model performance, interpret results through confusion matrices, and refine decision thresholds for optimal business impact.

Todd Bernson
2024-11-18

Overview
Model evaluation is the next step in any machine learning pipeline. It helps us determine how well our models perform and whether they meet the business goals of the project. In churn prediction, evaluation goes beyond just accuracy; metrics like AUC-ROC, F1 score, precision, and recall provide a nuanced understanding of a model's strengths and weaknesses, especially when dealing with imbalanced datasets. This article focuses on evaluating the performance of Logistic Regression, XGBoost, and DNN models in predicting customer churn.
Primary Metrics
Understanding and selecting the right metrics for evaluation ensures that the model aligns with the business objectives. Here’s an explanation of the key metrics used:
AUC-ROC (Area Under the Curve - Receiver Operating Characteristic)
- What it measures: The ability of the model to distinguish between churn and non-churn customers across various thresholds.
- Why it matters: AUC-ROC is useful for imbalanced datasets because it evaluates performance independent of classification thresholds.
- Ideal value: Close to 1.0 (better discrimination).
F1 Score
- What it measures: The harmonic mean of precision and recall.
- Why it matters: Balances false positives and false negatives, making it suitable for scenarios where both are costly.
- Ideal value: Higher values indicate a balanced model.
Precision
- What it measures: The percentage of correctly predicted churn cases out of all predicted churn cases.
- Why it matters: High precision ensures that we’re not overpredicting churn and unnecessarily targeting non-churn customers.
- Ideal value: Higher is better for targeting accuracy.
Recall
- What it measures: The percentage of actual churn cases correctly identified by the model.
- Why it matters: High recall ensures we capture as many actual churn cases as possible, reducing missed opportunities.
- Ideal value: Higher is better for comprehensive detection.
Results Comparison
After training the models, we evaluate their performance using the following SQL query in BigQuery ML:
SELECT
roc_auc,
precision,
recall,
f1_score
FROM
ML.EVALUATE(MODEL `<DATASET_NAME>.<MODEL_NAME>`,
TABLE `<DATASET_NAME>.<TABLE_NAME>`);
Example Results:
| Metric | Logistic Regression | XGBoost | DNN |
|---|---|---|---|
| AUC-ROC | 0.84 | 0.86 | 0.85 |
| F1 Score | 0.65 | 0.67 | 0.66 |
| Precision | 0.83 | 0.85 | 0.84 |
| Recall | 0.74 | 0.76 | 0.75 |
Insights:
- Logistic Regression: Performs well in precision but lags in recall, indicating it predicts churn more conservatively.
- XGBoost: Achieves the best overall performance, striking a balance between precision and recall.
- DNN: Performs comparably to XGBoost but requires more computational resources.
Confusion Matrix and Insights
A confusion matrix helps break down the model’s predictions into four categories:
- True Positives (TP): Correctly identified churn cases.
- True Negatives (TN): Correctly identified non-churn cases.
- False Positives (FP): Non-churn cases incorrectly predicted as churn.
- False Negatives (FN): Churn cases missed by the model.
SQL Query for Confusion Matrix
SELECT
predicted_label,
COUNT(*) AS count
FROM
ML.PREDICT(MODEL `<DATASET_NAME>.<MODEL_NAME>`,
TABLE `<DATASET_NAME>.<TABLE_NAME>`)
GROUP BY
predicted_label;
Real-World Implications:
- False Positives: Lead to unnecessary targeting of non-churn customers, wasting resources.
- False Negatives: Result in missed opportunities to retain at-risk customers, directly impacting revenue.
By analyzing the confusion matrix, we can adjust the decision thresholds or hyperparameters to address these issues.
Code Snippets for Metric Evaluation
To compare performance across all models, use the following SQL query:
SELECT
model_name,
roc_auc,
precision,
recall,
f1_score
FROM
ML.TRAINING_INFO(MODEL `<DATASET_NAME>.<MODEL_NAME>`);
This query provides a summary of all key metrics for each model, helping you choose the best one for deployment.
Evaluating model performance is an iterative process. By using metrics like AUC-ROC, F1 score, precision, and recall, and analyzing confusion matrices, we can refine our models to strike the perfect balance between prediction accuracy and business value.
Read More
View all posts
AI/ML
Why Enterprise AI Must Be Application-Led, Not Agent-Led
A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson
2025-12-02

AI/ML
Application-First Agentic AI
Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson
2025-11-28
AI/ML
Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton
2025-08-22