Back to Insights
AI/ML

Scaling an AI Voice Platform: Lessons in Performance and Cost Optimization on AWS

Todd Bernson, CTO of BSC Analytics and USMC veteran, shares real-world strategies for scaling an AI voice cloning platform on AWS. Learn when to use Lambda, EKS, or SageMaker for inference, how to queue jobs with SQS and Step Functions, and the CloudWatch metrics that actually matter. Todd breaks down performance tuning, cost optimization, and workload types, making this a tactical guide for any engineering team deploying AI voice solutions at scale.

Todd Bernson

2025-06-18

By Todd Bernson, CTO of BSC Analytics, USMC Veteran, and Guy Who Tunes Inference and Deadlifts


Building an AI-powered voice cloning platform is fun. Watching it get crushed under load because you didn’t scale it properly? Not so much.

In this post, we’re talking about real-world lessons from scaling a voice cloning solution that generates and serves thousands of audio messages — personalized, on-demand, and secured in AWS. Not in theory. In production. With logs to prove it.


TL;DR

You’ll learn:

  • When to use EKS vs. SageMaker for inference
  • How to batch workloads and queue intelligently
  • Cost control levers that keep your CFO from panicking
  • Why CloudWatch is your best friend and worst critic

The Problem

Generating voice responses isn’t like querying a database. Every request involves:

  • Model inference (heavy compute)
  • Audio storage (and sometimes conversion)
  • Input validation
  • Possibly authentication

Multiply that by tens of thousands of requests per day, and things start to sweat.

So how do you scale?


Step 1: Know Your Workload Types

Not all voice generation is equal.

Lightweight:

  • Short responses (“Your appointment is confirmed.”)
  • Real-time generation (user is waiting)
  • Low concurrency

Use: AWS Lambda

Heavyweight:

  • Longform responses
  • Background jobs (e.g., batch generation of 5,000 voicemails)
  • High concurrency

Use: EKS (spot for batch, on-demand for latency-sensitive)

GPU-Intensive:

  • Complex voices, multi-speaker, multi-language synthesis
  • Realtime delivery with near-zero latency
  • High fidelity outputs

Use: SageMaker endpoints (with multi-model containers if needed)


Step 2: Queue Everything

Even the fastest systems benefit from decoupling.

  • API Gateway triggers SQS → SQS triggers EKS
  • Use Step Functions for batch orchestration
  • Prioritize workloads (e.g., VIP client messages jump the queue)

This buys you buffer time, allows retry logic, and improves overall system health.


Step 3: Watch the Watchers (aka CloudWatch)

What to monitor:

  • EKS CPU/memory % over time
  • Lambda duration and cold start counts
  • API Gateway 5xx and latency percentiles
  • SQS queue length (spikes = backlog = unhappy customers)

Set alarms. Send alerts. Watch for cost and scale patterns.


Step 4: Storage Strategy

Don't just dump audio into S3 and forget it. Be strategic.

  • Use S3 Standard for recently accessed files
  • Transition to Infrequent Access after 30 days
  • Lifecycle delete after 90–180 days unless marked otherwise

Bonus: tag files by use case (e.g., welcome-message, alert, promo) and optimize access patterns.


Step 5: Cost Optimization Tactics

EKS

  • Spot tasks for batch jobs (up to 90% cheaper)
  • Tune task CPU/memory to match actual model requirements
  • Use CloudWatch metrics to scale up/down containers

API Gateway

  • If you exceed 10M calls/month, consider ALB + Lambda via Lambda Function URLs

CloudFront

  • Cache voice files when possible
  • Use signed URLs for access control (not public-read S3)
  • What I did instead of ☝️ was mount S3 directly to the pod in EKS to simplify permissions.

Architecture Snapshot

[Frontend] → [API Gateway]
     ↓             ↓
 [Auth Layer] → [SQS]
                     ↓
      			[EKS]
               ↓         ↓
          [S3 Audio]   [CloudWatch Logs]

Success Metrics That Matter

  • ✅ Avg response time
  • ✅ Batch jobs processed within SLA window
  • ✅ Cost per voice file
  • ✅ API success rate

If you’re not measuring these, you’re flying blind.


Final Thoughts

Scaling a voice AI platform isn’t about tossing more compute at the problem. It’s about:

  • Understanding what type of workload you’re running
  • Decoupling smartly
  • Tuning services like an engine, not a hammer
  • Building enough observability to know when things go sideways

The best part? With AWS, you can build something that scales to millions — and still fits in a startup budget. If you design it right.

Todd Bernson

CTO