BSC Analytics

Scaling an AI Voice Platform: Lessons in Performance and Cost Optimization on AWS

By Todd Bernson
Chief Technical Officer BSC Analytics

18 Jun 2025

By Todd Bernson, CTO of BSC Analytics, USMC Veteran, and Guy Who Tunes Inference and Deadlifts

Building an AI-powered voice cloning platform is fun. Watching it get crushed under load because you didn’t scale it properly? Not so much.

In this post, we’re talking about real-world lessons from scaling a voice cloning solution that generates and serves thousands of audio messages — personalized, on-demand, and secured in AWS. Not in theory. In production. With logs to prove it.

TL;DR

You’ll learn:

When to use EKS vs. SageMaker for inference
How to batch workloads and queue intelligently
Cost control levers that keep your CFO from panicking
Why CloudWatch is your best friend and worst critic

The Problem

Generating voice responses isn’t like querying a database. Every request involves:

Model inference (heavy compute)
Audio storage (and sometimes conversion)
Input validation
Possibly authentication

Multiply that by tens of thousands of requests per day, and things start to sweat.

So how do you scale?

Step 1: Know Your Workload Types

Not all voice generation is equal.

Lightweight:

Short responses (“Your appointment is confirmed.”)
Real-time generation (user is waiting)
Low concurrency

Use: AWS Lambda

Heavyweight:

Longform responses
Background jobs (e.g., batch generation of 5,000 voicemails)
High concurrency

Use: EKS (spot for batch, on-demand for latency-sensitive)

GPU-Intensive:

Complex voices, multi-speaker, multi-language synthesis
Realtime delivery with near-zero latency
High fidelity outputs

Use: SageMaker endpoints (with multi-model containers if needed)

Step 2: Queue Everything

Even the fastest systems benefit from decoupling.

API Gateway triggers SQS → SQS triggers EKS
Use Step Functions for batch orchestration
Prioritize workloads (e.g., VIP client messages jump the queue)

This buys you buffer time, allows retry logic, and improves overall system health.

Step 3: Watch the Watchers (aka CloudWatch)

What to monitor:

EKS CPU/memory % over time
Lambda duration and cold start counts
API Gateway 5xx and latency percentiles
SQS queue length (spikes = backlog = unhappy customers)

Set alarms. Send alerts. Watch for cost and scale patterns.

Step 4: Storage Strategy

Don't just dump audio into S3 and forget it. Be strategic.

Use S3 Standard for recently accessed files
Transition to Infrequent Access after 30 days
Lifecycle delete after 90–180 days unless marked otherwise

Bonus: tag files by use case (e.g., welcome-message, alert, promo) and optimize access patterns.

Step 5: Cost Optimization Tactics

EKS

Spot tasks for batch jobs (up to 90% cheaper)
Tune task CPU/memory to match actual model requirements
Use CloudWatch metrics to scale up/down containers

API Gateway

If you exceed 10M calls/month, consider ALB + Lambda via Lambda Function URLs

CloudFront

Cache voice files when possible
Use signed URLs for access control (not public-read S3)
What I did instead of ☝️ was mount S3 directly to the pod in EKS to simplify permissions.

Architecture Snapshot

[Frontend] → [API Gateway]
     ↓             ↓
 [Auth Layer] → [SQS]
                     ↓
      			[EKS]
               ↓         ↓
          [S3 Audio]   [CloudWatch Logs]

Success Metrics That Matter

✅ Avg response time
✅ Batch jobs processed within SLA window
✅ Cost per voice file
✅ API success rate

If you’re not measuring these, you’re flying blind.

Final Thoughts

Scaling a voice AI platform isn’t about tossing more compute at the problem. It’s about:

Understanding what type of workload you’re running
Decoupling smartly
Tuning services like an engine, not a hammer
Building enough observability to know when things go sideways

The best part? With AWS, you can build something that scales to millions — and still fits in a startup budget. If you design it right.

AI and ML, AWS, Kubernetes, API