Batch Inference with SageMaker Endpoints
Batch inference is a powerful alternative to real-time inference for processing large datasets in SageMaker. This method involves splitting the data into manageable chunks, sending them to a SageMaker endpoint, and processing predictions in batches. It is optimized for offline analysis and scheduled processing tasks, reducing computational overhead. The article provides a complete workflow for implementing batch inference, including dataset preparation, invoking endpoints, and post-processing predictions.

Todd Bernson
2025-01-06

Objective
This article demonstrates how to perform batch inference using SageMaker endpoints, focusing on handling large datasets efficiently by dividing them into manageable chunks and processing predictions in batches.
Batch Processing vs. Real-Time Inference
Real-Time Inference
- Designed for single or small sets of inputs.
- Provides low-latency predictions ideal for real-time applications (e.g., user-facing APIs).
Batch Processing
- Handles large datasets by splitting them into smaller chunks.
- Optimized for scenarios where predictions are not time-sensitive, such as offline data analysis or scheduled processing jobs.
Batch inference reduces the overhead of invoking the endpoint repeatedly for individual inputs, making it more efficient for large-scale data.
Preparing Test Datasets for Inference
We prepare the test dataset by loading it into memory and formatting it as required by the SageMaker endpoint. In this case, the endpoint accepts CSV-formatted inputs.
Code Example:
import pandas as pd
# Load test dataset
test_features = pd.read_csv('test_features.csv', header=None)
# Preview test data
print(test_features.head())
Invoking the SageMaker Endpoint in Batches
To perform batch inference, the dataset is divided into chunks (batches), and each batch is sent to the endpoint for predictions. The results are collected and post-processed.
Code for Dividing Datasets into Chunks
The following code processes the test dataset in chunks of 100 rows:
import boto3
# Define batch size
batch_size = 100
predictions = []
# SageMaker runtime client
runtime_client = boto3.client('sagemaker-runtime')
# Process data in batches
for i in range(0, len(test_features), batch_size):
batch = test_features.iloc[i:i + batch_size]
batch_data = '\n'.join([','.join(map(str, row)) for row in batch.values])
# Invoke endpoint
response = runtime_client.invoke_endpoint(
EndpointName=predictor.endpoint_name,
Body=batch_data,
ContentType='text/csv'
)
# Decode predictions
raw_result = response['Body'].read().decode('utf-8').strip()
batch_predictions = list(map(float, raw_result.split('\n')))
predictions.extend(batch_predictions)
print("Batch inference completed.")
Output Example
The predictions are returned as probabilities:
[0.12, 0.87, 0.45, 0.78, 0.23, ...]
These can be thresholded (e.g., > 0.5) to classify inputs into binary categories.
Handling Predictions and Post-Processing
After obtaining the predictions, additional steps may include:
- Thresholding probabilities to generate binary classifications.
- Merging predictions with input data for reporting.
- Saving the results to a file or database for further analysis.
Code Example:
import numpy as np
# Convert probabilities to binary predictions
binary_predictions = (np.array(predictions) > 0.5).astype(int)
# Save predictions to CSV
output = pd.DataFrame({
'Prediction': predictions,
'Binary Classification': binary_predictions
})
output.to_csv('batch_predictions.csv', index=False)
print("Predictions saved to batch_predictions.csv")
Visual Representation of the Batch Inference Pipeline
Below is a simplified pipeline for batch inference:
- Input Data Preparation:
- Load and preprocess the test dataset.
- Batch Processing:
- Split the dataset into smaller chunks.
- Send each chunk to the SageMaker endpoint.
- Collect predictions for each batch.
- Post-Processing:
- Combine predictions.
- Threshold probabilities to generate classifications.
- Save the final results for downstream tasks.
Batch inference with SageMaker endpoints provides an efficient way to handle large datasets. By dividing the data into chunks and processing it in batches, we optimize resource utilization while maintaining flexibility.
Read More
View all posts
AI/ML
Why Enterprise AI Must Be Application-Led, Not Agent-Led
A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson
2025-12-02

AI/ML
Application-First Agentic AI
Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson
2025-11-28
AI/ML
Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton
2025-08-22