AI/ML

Monitoring and Debugging ML Models with Azure Application Insights

Learn how to monitor, debug, and optimize machine learning models in production using Azure Application Insights. This guide covers configuring telemetry with Azure ML Workspaces, tracking model performance metrics, setting up alerts for failures, and visualizing logs and dashboards to maintain reliability.

Todd Bernson

2025-01-13

Machine learning models in production are only as effective as their reliability and performance. Monitoring, debugging, and optimizing deployed models are critical to ensuring business value. Azure Application Insights provides a robust toolset to track model performance, identify bottlenecks, and maintain the reliability of machine learning pipelines.

In this article, we will:

Configure Application Insights with Azure Machine Learning (ML) Workspaces using Terraform.
Use telemetry data to monitor and debug models.
Set up alerts for underperforming models or pipeline failures.
Showcase sample logs and dashboards for performance metrics.

Step 1: Configuring Application Insights with ML Workspaces in Terraform

Azure Application Insights integrates seamlessly with Azure Machine Learning Workspaces to collect and visualize telemetry data. Below is the Terraform configuration for setting up Application Insights and linking it to an ML Workspace.

Terraform Configuration

Create a main.tf file with the following configuration:

resource "azurerm_application_insights" "ml_insights" {
  name                = "ml-application-insights"
  location            = azurerm_resource_group.this.location
  resource_group_name = azurerm_resource_group.this.name
  application_type    = "web"

  tags = {
    environment = "production"
  }
}

resource "azurerm_machine_learning_workspace" "ml_workspace" {
  name                = "ml-workspace"
  location            = azurerm_resource_group.this.location
  resource_group_name = azurerm_resource_group.this.name

  application_insights_id = azurerm_application_insights.ml_insights.id
  container_registry_id   = azurerm_container_registry.this.id
  key_vault_id            = azurerm_key_vault.this.id
  storage_account_id      = azurerm_storage_account.this.id

  identity {
    type = "SystemAssigned"
  }

  tags = {
    environment = "production"
  }
}

Deployment Steps

Initialize Terraform:
```
terraform init
```
Apply the configuration:
```
terraform apply
```

Application Insights is now linked to your Azure ML Workspace.

Step 2: Using Telemetry Data to Track Model Performance

Application Insights automatically collects telemetry data such as:

API Calls: Latency and success rates of model inference endpoints.
Pipeline Metrics: Execution times and success rates for ML pipelines.
Custom Events: User-defined metrics such as model accuracy or loss values.

Integrating Custom Metrics in Python

You can send custom metrics to Application Insights directly from your ML pipeline.

from applicationinsights import TelemetryClient

# Initialize Application Insights Telemetry Client
tc = TelemetryClient("<YOUR_INSTRUMENTATION_KEY>")

# Track Custom Events
tc.track_event("Model Accuracy", {"Accuracy": 0.85})
tc.track_metric("Training Time", 120)  # Time in seconds

# Send Data
tc.flush()

Example: Tracking Model Inference Latency

import time
from applicationinsights import TelemetryClient

tc = TelemetryClient("<YOUR_INSTRUMENTATION_KEY>")

# Simulate Model Inference
start_time = time.time()
# Call model inference logic here
end_time = time.time()

# Log latency
latency = end_time - start_time
tc.track_metric("Inference Latency", latency)
tc.flush()

Step 3: Setting Up Alerts for Underperforming Models or Pipeline Failures

Azure Monitor allows you to set up alerts based on Application Insights telemetry. For instance, you can configure alerts for:

High latency in model inference.
Pipeline run failures.
Low accuracy or F1-score.

Setting Up Alerts

Navigate to Azure Portal > Application Insights > Alerts.
Create a new alert rule:
- Resource: Select your Application Insights instance.
- Condition: Define thresholds (e.g., latency > 2 seconds).
- Action Group: Specify email or webhook notifications.
Save and enable the alert.

Step 4: Sample Logs and Dashboard Screenshots

Example Logs

Sample logs from Application Insights for model inference:

Timestamp	Metric	Value	Status
2025-01-13T10:00:00Z	Inference Latency (s)	1.2	Success
2025-01-13T10:05:00Z	Inference Latency (s)	3.5	Warning
2025-01-13T10:10:00Z	Training Accuracy (%)	84.2	Success
2025-01-13T10:15:00Z	Pipeline Execution	Failed	Error

Key Takeaway

Azure Application Insights provides powerful tools for monitoring and debugging machine learning models. By integrating telemetry data and setting up alerts, you can ensure the reliability and performance of your ML deployments. With real-time insights and proactive monitoring, Application Insights helps maintain the health of your models and optimize them for better business outcomes.

Todd Bernson

CTO

View all posts

AI/ML

Why Enterprise AI Must Be Application-Led, Not Agent-Led

A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson

2025-12-02

AI/ML

Application-First Agentic AI

Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson

2025-11-28

AI/ML

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton

2025-08-22

Monitoring and Debugging ML Models with Azure Application Insights

Step 1: Configuring Application Insights with ML Workspaces in Terraform

Terraform Configuration

Deployment Steps

Step 2: Using Telemetry Data to Track Model Performance

Integrating Custom Metrics in Python

Example: Tracking Model Inference Latency

Step 3: Setting Up Alerts for Underperforming Models or Pipeline Failures

Setting Up Alerts

Step 4: Sample Logs and Dashboard Screenshots

Example Logs

Key Takeaway

Read More

Why Enterprise AI Must Be Application-Led, Not Agent-Led

Application-First Agentic AI

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed