Scaling an AI Voice Platform: Lessons in Performance and Cost Optimization on AWS
Todd Bernson, CTO of BSC Analytics and USMC veteran, shares real-world strategies for scaling an AI voice cloning platform on AWS. Learn when to use Lambda, EKS, or SageMaker for inference, how to queue jobs with SQS and Step Functions, and the CloudWatch metrics that actually matter. Todd breaks down performance tuning, cost optimization, and workload types, making this a tactical guide for any engineering team deploying AI voice solutions at scale.

Todd Bernson
2025-06-18

By Todd Bernson, CTO of BSC Analytics, USMC Veteran, and Guy Who Tunes Inference and Deadlifts
Building an AI-powered voice cloning platform is fun. Watching it get crushed under load because you didn’t scale it properly? Not so much.
In this post, we’re talking about real-world lessons from scaling a voice cloning solution that generates and serves thousands of audio messages — personalized, on-demand, and secured in AWS. Not in theory. In production. With logs to prove it.
TL;DR
You’ll learn:
- When to use EKS vs. SageMaker for inference
- How to batch workloads and queue intelligently
- Cost control levers that keep your CFO from panicking
- Why CloudWatch is your best friend and worst critic
The Problem
Generating voice responses isn’t like querying a database. Every request involves:
- Model inference (heavy compute)
- Audio storage (and sometimes conversion)
- Input validation
- Possibly authentication
Multiply that by tens of thousands of requests per day, and things start to sweat.
So how do you scale?
Step 1: Know Your Workload Types
Not all voice generation is equal.
Lightweight:
- Short responses (“Your appointment is confirmed.”)
- Real-time generation (user is waiting)
- Low concurrency
Use: AWS Lambda
Heavyweight:
- Longform responses
- Background jobs (e.g., batch generation of 5,000 voicemails)
- High concurrency
Use: EKS (spot for batch, on-demand for latency-sensitive)
GPU-Intensive:
- Complex voices, multi-speaker, multi-language synthesis
- Realtime delivery with near-zero latency
- High fidelity outputs
Use: SageMaker endpoints (with multi-model containers if needed)
Step 2: Queue Everything
Even the fastest systems benefit from decoupling.
- API Gateway triggers SQS → SQS triggers EKS
- Use Step Functions for batch orchestration
- Prioritize workloads (e.g., VIP client messages jump the queue)
This buys you buffer time, allows retry logic, and improves overall system health.
Step 3: Watch the Watchers (aka CloudWatch)
What to monitor:
- EKS CPU/memory % over time
- Lambda duration and cold start counts
- API Gateway 5xx and latency percentiles
- SQS queue length (spikes = backlog = unhappy customers)
Set alarms. Send alerts. Watch for cost and scale patterns.
Step 4: Storage Strategy
Don't just dump audio into S3 and forget it. Be strategic.
- Use S3 Standard for recently accessed files
- Transition to Infrequent Access after 30 days
- Lifecycle delete after 90–180 days unless marked otherwise
Bonus: tag files by use case (e.g., welcome-message, alert, promo) and optimize access patterns.
Step 5: Cost Optimization Tactics
EKS
- Spot tasks for batch jobs (up to 90% cheaper)
- Tune task CPU/memory to match actual model requirements
- Use CloudWatch metrics to scale up/down containers
API Gateway
- If you exceed 10M calls/month, consider ALB + Lambda via Lambda Function URLs
CloudFront
- Cache voice files when possible
- Use signed URLs for access control (not public-read S3)
- What I did instead of ☝️ was mount S3 directly to the pod in EKS to simplify permissions.
Architecture Snapshot
[Frontend] → [API Gateway]
↓ ↓
[Auth Layer] → [SQS]
↓
[EKS]
↓ ↓
[S3 Audio] [CloudWatch Logs]
Success Metrics That Matter
- ✅ Avg response time
- ✅ Batch jobs processed within SLA window
- ✅ Cost per voice file
- ✅ API success rate
If you’re not measuring these, you’re flying blind.
Final Thoughts
Scaling a voice AI platform isn’t about tossing more compute at the problem. It’s about:
- Understanding what type of workload you’re running
- Decoupling smartly
- Tuning services like an engine, not a hammer
- Building enough observability to know when things go sideways
The best part? With AWS, you can build something that scales to millions — and still fits in a startup budget. If you design it right.
Read More
View all posts
AI/ML
Why Enterprise AI Must Be Application-Led, Not Agent-Led
A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson
2025-12-02

AI/ML
Application-First Agentic AI
Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson
2025-11-28
AI/ML
Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton
2025-08-22