AI/ML

Evaluating Model Performance: Metrics and Insights

This article provides an in-depth evaluation of customer churn prediction models using metrics like AUC-ROC, F1 Score, Precision, and Recall. It compares Logistic Regression, XGBoost, and Deep Neural Networks (DNN), analyzing their strengths and weaknesses. SQL-based BigQuery ML commands demonstrate how to assess model performance, interpret results through confusion matrices, and refine decision thresholds for optimal business impact.

Todd Bernson

2024-11-18

Overview

Model evaluation is the next step in any machine learning pipeline. It helps us determine how well our models perform and whether they meet the business goals of the project. In churn prediction, evaluation goes beyond just accuracy; metrics like AUC-ROC, F1 score, precision, and recall provide a nuanced understanding of a model's strengths and weaknesses, especially when dealing with imbalanced datasets. This article focuses on evaluating the performance of Logistic Regression, XGBoost, and DNN models in predicting customer churn.

Primary Metrics

Understanding and selecting the right metrics for evaluation ensures that the model aligns with the business objectives. Here’s an explanation of the key metrics used:

AUC-ROC (Area Under the Curve - Receiver Operating Characteristic)

What it measures: The ability of the model to distinguish between churn and non-churn customers across various thresholds.
Why it matters: AUC-ROC is useful for imbalanced datasets because it evaluates performance independent of classification thresholds.
Ideal value: Close to 1.0 (better discrimination).

F1 Score

What it measures: The harmonic mean of precision and recall.
Why it matters: Balances false positives and false negatives, making it suitable for scenarios where both are costly.
Ideal value: Higher values indicate a balanced model.

Precision

What it measures: The percentage of correctly predicted churn cases out of all predicted churn cases.
Why it matters: High precision ensures that we’re not overpredicting churn and unnecessarily targeting non-churn customers.
Ideal value: Higher is better for targeting accuracy.

Recall

What it measures: The percentage of actual churn cases correctly identified by the model.
Why it matters: High recall ensures we capture as many actual churn cases as possible, reducing missed opportunities.
Ideal value: Higher is better for comprehensive detection.

Results Comparison

After training the models, we evaluate their performance using the following SQL query in BigQuery ML:

SELECT
  roc_auc,
  precision,
  recall,
  f1_score
FROM
  ML.EVALUATE(MODEL `<DATASET_NAME>.<MODEL_NAME>`, 
              TABLE `<DATASET_NAME>.<TABLE_NAME>`);

Example Results:

Metric	Logistic Regression	XGBoost	DNN
AUC-ROC	0.84	0.86	0.85
F1 Score	0.65	0.67	0.66
Precision	0.83	0.85	0.84
Recall	0.74	0.76	0.75

Insights:

Logistic Regression: Performs well in precision but lags in recall, indicating it predicts churn more conservatively.
XGBoost: Achieves the best overall performance, striking a balance between precision and recall.
DNN: Performs comparably to XGBoost but requires more computational resources.

Confusion Matrix and Insights

A confusion matrix helps break down the model’s predictions into four categories:

True Positives (TP): Correctly identified churn cases.
True Negatives (TN): Correctly identified non-churn cases.
False Positives (FP): Non-churn cases incorrectly predicted as churn.
False Negatives (FN): Churn cases missed by the model.

SQL Query for Confusion Matrix

SELECT
  predicted_label,
  COUNT(*) AS count
FROM
  ML.PREDICT(MODEL `<DATASET_NAME>.<MODEL_NAME>`, 
             TABLE `<DATASET_NAME>.<TABLE_NAME>`)
GROUP BY
  predicted_label;

Real-World Implications:

False Positives: Lead to unnecessary targeting of non-churn customers, wasting resources.
False Negatives: Result in missed opportunities to retain at-risk customers, directly impacting revenue.

By analyzing the confusion matrix, we can adjust the decision thresholds or hyperparameters to address these issues.

Code Snippets for Metric Evaluation

To compare performance across all models, use the following SQL query:

SELECT
  model_name,
  roc_auc,
  precision,
  recall,
  f1_score
FROM
  ML.TRAINING_INFO(MODEL `<DATASET_NAME>.<MODEL_NAME>`);

This query provides a summary of all key metrics for each model, helping you choose the best one for deployment.

Evaluating model performance is an iterative process. By using metrics like AUC-ROC, F1 score, precision, and recall, and analyzing confusion matrices, we can refine our models to strike the perfect balance between prediction accuracy and business value.

Todd Bernson

CTO

View all posts

AI/ML

Why Enterprise AI Must Be Application-Led, Not Agent-Led

A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson

2025-12-02

AI/ML

Application-First Agentic AI

Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson

2025-11-28

AI/ML

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton

2025-08-22

Evaluating Model Performance: Metrics and Insights

Overview

Primary Metrics

AUC-ROC (Area Under the Curve - Receiver Operating Characteristic)

F1 Score

Precision

Recall

Results Comparison

Example Results:

Insights:

Confusion Matrix and Insights

SQL Query for Confusion Matrix

Real-World Implications:

Code Snippets for Metric Evaluation

Read More

Why Enterprise AI Must Be Application-Led, Not Agent-Led

Application-First Agentic AI

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed