
By Todd Bernson, CTO of BSC Analytics, USMC Veteran, and Guy Who Tunes Inference and Deadlifts
Building an AI-powered voice cloning platform is fun. Watching it get crushed under load because you didn’t scale it properly? Not so much.
In this post, we’re talking about real-world lessons from scaling a voice cloning solution that generates and serves thousands of audio messages — personalized, on-demand, and secured in AWS. Not in theory. In production. With logs to prove it.
TL;DR
You’ll learn:
- When to use EKS vs. SageMaker for inference
- How to batch workloads and queue intelligently
- Cost control levers that keep your CFO from panicking
- Why CloudWatch is your best friend and worst critic
The Problem
Generating voice responses isn’t like querying a database. Every request involves:
- Model inference (heavy compute)
- Audio storage (and sometimes conversion)
- Input validation
- Possibly authentication
Multiply that by tens of thousands of requests per day, and things start to sweat.
So how do you scale?
Step 1: Know Your Workload Types
Not all voice generation is equal.
Lightweight:
- Short responses (“Your appointment is confirmed.”)
- Real-time generation (user is waiting)
- Low concurrency
Use: AWS Lambda
Heavyweight:
- Longform responses
- Background jobs (e.g., batch generation of 5,000 voicemails)
- High concurrency
Use: EKS (spot for batch, on-demand for latency-sensitive)
GPU-Intensive:
- Complex voices, multi-speaker, multi-language synthesis
- Realtime delivery with near-zero latency
- High fidelity outputs
Use: SageMaker endpoints (with multi-model containers if needed)
Step 2: Queue Everything
Even the fastest systems benefit from decoupling.
- API Gateway triggers SQS → SQS triggers EKS
- Use Step Functions for batch orchestration
- Prioritize workloads (e.g., VIP client messages jump the queue)
This buys you buffer time, allows retry logic, and improves overall system health.
Step 3: Watch the Watchers (aka CloudWatch)
What to monitor:
- EKS CPU/memory % over time
- Lambda duration and cold start counts
- API Gateway 5xx and latency percentiles
- SQS queue length (spikes = backlog = unhappy customers)
Set alarms. Send alerts. Watch for cost and scale patterns.
Step 4: Storage Strategy
Don't just dump audio into S3 and forget it. Be strategic.
- Use S3 Standard for recently accessed files
- Transition to Infrequent Access after 30 days
- Lifecycle delete after 90–180 days unless marked otherwise
Bonus: tag files by use case (e.g., welcome-message
, alert
, promo
) and optimize access patterns.
Step 5: Cost Optimization Tactics
EKS
- Spot tasks for batch jobs (up to 90% cheaper)
- Tune task CPU/memory to match actual model requirements
- Use CloudWatch metrics to scale up/down containers
API Gateway
- If you exceed 10M calls/month, consider ALB + Lambda via Lambda Function URLs
CloudFront
- Cache voice files when possible
- Use signed URLs for access control (not public-read S3)
- What I did instead of ☝️ was mount S3 directly to the pod in EKS to simplify permissions.
Architecture Snapshot
[Frontend] → [API Gateway]
↓ ↓
[Auth Layer] → [SQS]
↓
[EKS]
↓ ↓
[S3 Audio] [CloudWatch Logs]
Success Metrics That Matter
- ✅ Avg response time
- ✅ Batch jobs processed within SLA window
- ✅ Cost per voice file
- ✅ API success rate
If you’re not measuring these, you’re flying blind.
Final Thoughts
Scaling a voice AI platform isn’t about tossing more compute at the problem. It’s about:
- Understanding what type of workload you’re running
- Decoupling smartly
- Tuning services like an engine, not a hammer
- Building enough observability to know when things go sideways
The best part? With AWS, you can build something that scales to millions — and still fits in a startup budget. If you design it right.