Problem
An enterprise customer running workloads on Amazon ECS/Fargate encountered a critical reliability issue: their tasks were configured to rely exclusively on Active Directory (AD) DNS as the primary resolver. Whenever AD DNS experienced downtime or latency, DNS lookups stalled or failed, causing cascading delays in service calls. This meant that even non-AD workloads were held hostage to the health of AD DNS, creating a single point of failure and unnecessary operational risk.
Solution
BSC Analytics engineers re-architected the customer’s DNS strategy to eliminate the bottleneck. Instead of funneling all queries through AD DNS, we made the VPC’s built-in Route 53 Resolver (AmazonProvidedDNS) the primary DNS resolver for all ECS/Fargate workloads.
From there, we configured Route 53 outbound endpoints with conditional forwarding rules:
- AD-owned zones (such as corp.local and _msdcs) were forwarded directly to AD DNS.
- All other queries (public internet domains, AWS private hosted zones, and Cloud Map service discovery) resolved natively in Route 53.
To further modernize the architecture, we also introduced ECS Service Discovery as an optional feature, allowing applications to resolve each other directly without needing AD in the resolution path. This shift dramatically reduced reliance on AD DNS and provided a more resilient, cloud-native foundation.
Benefits
The impact of this change was immediate and substantial:
- Isolated failure domains: Loss of AD DNS no longer crippled general name resolution. Only AD-specific lookups were impacted, minimizing blast radius.
- Faster responses: Route 53 handled the majority of queries natively, reducing latency and avoiding cascading timeouts.
- Improved resiliency: ECS workloads could now resolve dependencies without waiting on AD DNS availability.
- Future-proof architecture: ECS Service Discovery introduced a direct, scalable option for app-to-app resolution without external dependencies.
Because the change was rolled out with a short, controlled DNS cutover, validated through outbound rule checks and canary service testing, the client was able to transition smoothly with minimal risk.
With this solution in place—and backed by BSC Analytics’ 24/7 monitoring and NOC support—our customer gained both immediate reliability improvements and a long-term DNS strategy better aligned with cloud-native best practices.