AI/ML

Scaling an AI Voice Platform: Lessons in Performance and Cost Optimization on AWS

Todd Bernson, CTO of BSC Analytics and USMC veteran, shares real-world strategies for scaling an AI voice cloning platform on AWS. Learn when to use Lambda, EKS, or SageMaker for inference, how to queue jobs with SQS and Step Functions, and the CloudWatch metrics that actually matter. Todd breaks down performance tuning, cost optimization, and workload types, making this a tactical guide for any engineering team deploying AI voice solutions at scale.

Todd Bernson

2025-06-18

By Todd Bernson, CTO of BSC Analytics, USMC Veteran, and Guy Who Tunes Inference and Deadlifts

Building an AI-powered voice cloning platform is fun. Watching it get crushed under load because you didn’t scale it properly? Not so much.

In this post, we’re talking about real-world lessons from scaling a voice cloning solution that generates and serves thousands of audio messages — personalized, on-demand, and secured in AWS. Not in theory. In production. With logs to prove it.

TL;DR

You’ll learn:

When to use EKS vs. SageMaker for inference
How to batch workloads and queue intelligently
Cost control levers that keep your CFO from panicking
Why CloudWatch is your best friend and worst critic

The Problem

Generating voice responses isn’t like querying a database. Every request involves:

Model inference (heavy compute)
Audio storage (and sometimes conversion)
Input validation
Possibly authentication

Multiply that by tens of thousands of requests per day, and things start to sweat.

So how do you scale?

Step 1: Know Your Workload Types

Not all voice generation is equal.

Lightweight:

Short responses (“Your appointment is confirmed.”)
Real-time generation (user is waiting)
Low concurrency

Use: AWS Lambda

Heavyweight:

Longform responses
Background jobs (e.g., batch generation of 5,000 voicemails)
High concurrency

Use: EKS (spot for batch, on-demand for latency-sensitive)

GPU-Intensive:

Complex voices, multi-speaker, multi-language synthesis
Realtime delivery with near-zero latency
High fidelity outputs

Use: SageMaker endpoints (with multi-model containers if needed)

Step 2: Queue Everything

Even the fastest systems benefit from decoupling.

API Gateway triggers SQS → SQS triggers EKS
Use Step Functions for batch orchestration
Prioritize workloads (e.g., VIP client messages jump the queue)

This buys you buffer time, allows retry logic, and improves overall system health.

Step 3: Watch the Watchers (aka CloudWatch)

What to monitor:

EKS CPU/memory % over time
Lambda duration and cold start counts
API Gateway 5xx and latency percentiles
SQS queue length (spikes = backlog = unhappy customers)

Set alarms. Send alerts. Watch for cost and scale patterns.

Step 4: Storage Strategy

Don't just dump audio into S3 and forget it. Be strategic.

Use S3 Standard for recently accessed files
Transition to Infrequent Access after 30 days
Lifecycle delete after 90–180 days unless marked otherwise

Bonus: tag files by use case (e.g., welcome-message, alert, promo) and optimize access patterns.

Step 5: Cost Optimization Tactics

EKS

Spot tasks for batch jobs (up to 90% cheaper)
Tune task CPU/memory to match actual model requirements
Use CloudWatch metrics to scale up/down containers

API Gateway

If you exceed 10M calls/month, consider ALB + Lambda via Lambda Function URLs

CloudFront

Cache voice files when possible
Use signed URLs for access control (not public-read S3)
What I did instead of ☝️ was mount S3 directly to the pod in EKS to simplify permissions.

Architecture Snapshot

[Frontend] → [API Gateway]
     ↓             ↓
 [Auth Layer] → [SQS]
                     ↓
      			[EKS]
               ↓         ↓
          [S3 Audio]   [CloudWatch Logs]

Success Metrics That Matter

✅ Avg response time
✅ Batch jobs processed within SLA window
✅ Cost per voice file
✅ API success rate

If you’re not measuring these, you’re flying blind.

Final Thoughts

Scaling a voice AI platform isn’t about tossing more compute at the problem. It’s about:

Understanding what type of workload you’re running
Decoupling smartly
Tuning services like an engine, not a hammer
Building enough observability to know when things go sideways

The best part? With AWS, you can build something that scales to millions — and still fits in a startup budget. If you design it right.

Todd Bernson

CTO

View all posts

AI/ML

Why Enterprise AI Must Be Application-Led, Not Agent-Led

A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson

2025-12-02

AI/ML

Application-First Agentic AI

Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson

2025-11-28

AI/ML

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton

2025-08-22

Scaling an AI Voice Platform: Lessons in Performance and Cost Optimization on AWS

TL;DR

The Problem

Step 1: Know Your Workload Types

Lightweight:

Heavyweight:

GPU-Intensive:

Step 2: Queue Everything

Step 3: Watch the Watchers (aka CloudWatch)

What to monitor:

Step 4: Storage Strategy

Step 5: Cost Optimization Tactics

EKS

API Gateway

CloudFront

Architecture Snapshot

Success Metrics That Matter

Final Thoughts

Read More

Why Enterprise AI Must Be Application-Led, Not Agent-Led

Application-First Agentic AI

Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed