BSC Analytics

Evaluating Model Performance: Metrics and Insights

By Todd Bernson
Chief Technical Officer BSC Analytics

18 Nov 2024

Overview

Model evaluation is the next step in any machine learning pipeline. It helps us determine how well our models perform and whether they meet the business goals of the project. In churn prediction, evaluation goes beyond just accuracy; metrics like AUC-ROC, F1 score, precision, and recall provide a nuanced understanding of a model's strengths and weaknesses, especially when dealing with imbalanced datasets. This article focuses on evaluating the performance of Logistic Regression, XGBoost, and DNN models in predicting customer churn.

Primary Metrics

Understanding and selecting the right metrics for evaluation ensures that the model aligns with the business objectives. Here’s an explanation of the key metrics used:

AUC-ROC (Area Under the Curve - Receiver Operating Characteristic)

What it measures: The ability of the model to distinguish between churn and non-churn customers across various thresholds.
Why it matters: AUC-ROC is useful for imbalanced datasets because it evaluates performance independent of classification thresholds.
Ideal value: Close to 1.0 (better discrimination).

F1 Score

What it measures: The harmonic mean of precision and recall.
Why it matters: Balances false positives and false negatives, making it suitable for scenarios where both are costly.
Ideal value: Higher values indicate a balanced model.

Precision

What it measures: The percentage of correctly predicted churn cases out of all predicted churn cases.
Why it matters: High precision ensures that we’re not overpredicting churn and unnecessarily targeting non-churn customers.
Ideal value: Higher is better for targeting accuracy.

Recall

What it measures: The percentage of actual churn cases correctly identified by the model.
Why it matters: High recall ensures we capture as many actual churn cases as possible, reducing missed opportunities.
Ideal value: Higher is better for comprehensive detection.

Results Comparison

After training the models, we evaluate their performance using the following SQL query in BigQuery ML:

SELECT
  roc_auc,
  precision,
  recall,
  f1_score
FROM
  ML.EVALUATE(MODEL `<DATASET_NAME>.<MODEL_NAME>`, 
              TABLE `<DATASET_NAME>.<TABLE_NAME>`);

Example Results:

Metric	Logistic Regression	XGBoost	DNN
AUC-ROC	0.84	0.86	0.85
F1 Score	0.65	0.67	0.66
Precision	0.83	0.85	0.84
Recall	0.74	0.76	0.75

Insights:

Logistic Regression: Performs well in precision but lags in recall, indicating it predicts churn more conservatively.
XGBoost: Achieves the best overall performance, striking a balance between precision and recall.
DNN: Performs comparably to XGBoost but requires more computational resources.

Confusion Matrix and Insights

A confusion matrix helps break down the model’s predictions into four categories:

True Positives (TP): Correctly identified churn cases.
True Negatives (TN): Correctly identified non-churn cases.
False Positives (FP): Non-churn cases incorrectly predicted as churn.
False Negatives (FN): Churn cases missed by the model.

SQL Query for Confusion Matrix

SELECT
  predicted_label,
  COUNT(*) AS count
FROM
  ML.PREDICT(MODEL `<DATASET_NAME>.<MODEL_NAME>`, 
             TABLE `<DATASET_NAME>.<TABLE_NAME>`)
GROUP BY
  predicted_label;

Real-World Implications:

False Positives: Lead to unnecessary targeting of non-churn customers, wasting resources.
False Negatives: Result in missed opportunities to retain at-risk customers, directly impacting revenue.

By analyzing the confusion matrix, we can adjust the decision thresholds or hyperparameters to address these issues.

Code Snippets for Metric Evaluation

To compare performance across all models, use the following SQL query:

SELECT
  model_name,
  roc_auc,
  precision,
  recall,
  f1_score
FROM
  ML.TRAINING_INFO(MODEL `<DATASET_NAME>.<MODEL_NAME>`);

This query provides a summary of all key metrics for each model, helping you choose the best one for deployment.

Evaluating model performance is an iterative process. By using metrics like AUC-ROC, F1 score, precision, and recall, and analyzing confusion matrices, we can refine our models to strike the perfect balance between prediction accuracy and business value.

Google Cloud, BigQuery, Machine Learning, Data Analytics