AWS Lake Formation: Part 2 Advanced S3 Configurations
This section of this series on AWS Lake Formation focuses on optimizing Amazon S3 configurations using Terraform, specifically tailored for data lakes. ...

Todd Bernson
2024-06-28

This section of this series on AWS Lake Formation focuses on optimizing Amazon S3 configurations using Terraform, specifically tailored for data lakes.

We'll explore advanced techniques for structuring S3 buckets and implementing security and lifecycle management features.
Optimal S3 Bucket Structuring for Data Lakes
When configuring S3 buckets for data lakes, the structure and organization of the buckets are crucial for performance, cost efficiency, and ease of data management.
Clone the project here.
Best Practices for Bucket Structuring:
- Prefix and Folder Strategy:
- Tagging for Governance and Cost Management:
Here is the folder structure of the bucket we crawled:
aws s3 ls s3://commoncrawl/
.test/
cc-index/
contrib/
crawl-001/
crawl-002/
crawl-analysis/
crawl-data/
hive_analysis/
index2012/
mapred-temp/
meanpath/
parse-output-test/
parse-output/
projects/
static/
stats-output/
wikipedia/
Implementing Lifecycle Policies
Lifecycle policies in S3 automatically manage data storage and retention, helping to reduce costs and seamlessly manage data throughout its lifecycle.
Key Lifecycle Actions:
Transition to Infrequent Access:
Transitioning objects to the S3 Infrequent Access (IA) storage class is a cost-effective strategy for data accessed less frequently but still requires rapid access when needed. This storage class offers a lower storage cost than S3 Standard and is suitable for data accessed less often, but faster retrieval is still necessary.
- Ideal for backups, disaster recovery files, or older project files that are not needed regularly but must be readily accessible.
Expiration:
Automatically deleting old or obsolete data is crucial for managing storage costs and compliance. Expiration policies can be set to automatically remove objects after a defined period, ensuring that your data lake does not store unnecessary data.
- Useful for logs, temporary files, or datasets with a predetermined relevance period, such as event-driven data or promotional materials.
Terraform Code Snippet for Lifecycle Policies:
lifecycle_rule = [
{
id = "cleanup"
enabled = "true"
abort_incomplete_multipart_upload_days = 7
transition = [
{
days = 7
storage_class = "INTELLIGENT_TIERING"
}
]
expiration = {
days = 366
}
noncurrent_version_expiration = [
{
days = 366
}
]
}
]
versioning = {
enabled = true
}
Implementing Encryption at Rest
Encrypting data at rest in S3 is critical for ensuring the security of sensitive data within a data lake.
Encryption Options:
SSE-S3:
SSE-S3 provides an automated encryption solution where Amazon manages the encryption keys. It uses one of the strongest block ciphers available, AES-256, to encrypt your data at rest.
- Advantages: Simplicity and ease of use, as manual key management is not required.
- Use Cases: This solution is suitable for general-purpose data that needs to be secured at rest, where custom key management practices are not required.
SSE-KMS:
SSE-KMS enhances the security features of S3 encryption by allowing you to manage encryption keys using AWS KMS. This method gives you control over the keys and a detailed audit trail of their use.
- Advantages: Increased security control, key rotation, and audit capabilities. It also supports the creation of customer-managed keys with specific policies.
- Use Cases: Ideal for sensitive or regulated data, where compliance requirements dictate key management practices and access logging.
Terraform Code Snippet for Enabling Encryption:
server_side_encryption_configuration = {
rule = {
apply_server_side_encryption_by_default = {
sse_algorithm = "AES256"
}
}
}
Optimizing your S3 buckets with proper structuring, lifecycle policies, and encryption using Terraform enhances the efficiency and security of your data lake and aligns with best practices for cost management and compliance. By implementing these advanced configurations, you establish a robust foundation for your data lake infrastructure on AWS.
Visit my website here.
Read More
View all posts
AI/ML
Why Enterprise AI Must Be Application-Led, Not Agent-Led
A deep dive by Todd Bernson, CTO and Chief AI Officer, on why enterprise AI systems should be architected as application-led, deterministic platforms with embedded agentic AI—not fully autonomous agents. This article explains how API-first, governed, multi-channel architectures deliver higher reliability, compliance, scalability, and business value in real-world Fortune-500 environments.

Todd Bernson
2025-12-02

AI/ML
Application-First Agentic AI
Application-first agentic AI is emerging as the only reliable path to real enterprise ROI. In this in-depth analysis, Todd Bernson, CTO & CAIO, breaks down why most generative AI initiatives stall in production—and how disciplined enterprise architecture, deterministic workflows, and narrowly scoped AI agents can finally unlock repeatable business value. Using a real sprint-intelligence system as a case study, the article shows how organizations can combine serverless engineering, structured orchestration, and constrained LLM reasoning to reduce reporting effort, increase trust, eliminate hallucinations, and deliver actionable insights across engineering, operations, compliance, and customer experience.

Todd Bernson
2025-11-28
AI/ML
Why 95% of AI Projects Fail and How to Be Among the 5% That Succeed

Lee Hylton
2025-08-22