BLOG

Data Lake Monitoring & S3 Observability Blog

Technical posts on S3 monitoring, Delta Lake health, lakehouse observability, Athena tuning, and data pipeline debugging. For data engineers and platform teams.

FEATURED

Data Lake Health8 min read

Iceberg Orphan Files: Detection, Cleanup, and the Safe Order of Maintenance Operations

Orphan files in Apache Iceberg accumulate silently: Parquet files that live in S3 but are no longer referenced by any live snapshot. Here is how to detect them, understand how much storage they waste, and run maintenance in the order that does not break time-travel queries.

May 6, 2026

ALL POSTS

Data Lake Health6 min read

How to Expire Iceberg Snapshots Without Breaking Time-Travel Queries

Running expire_snapshots too aggressively can break time-travel queries your downstream consumers depend on. Here is how to set the retention window correctly and verify expiry ran without corrupting your table.

Apr 14, 2026

Data Lake Health7 min read

Manifest Bloat in Apache Iceberg: How to Detect It and When to Run rewrite_manifests

Iceberg manifest files are the index your query engines read before touching a single data file. When manifests bloat: thousands of small manifest files per snapshot, query planning slows significantly before any data is read. Here is how to detect manifest bloat and when rewrite_manifests actually helps.

Mar 24, 2026

Data Lake Health9 min read

The Small Files Problem in Iceberg, Delta Lake, and Hudi: Compaction Strategies Compared

Small files inflate query times and S3 request costs across every open table format. The symptoms look the same: slow scans, high GET counts, growing object counts. But the fix is format-specific. Here is how to detect the small files problem in Iceberg, Delta Lake, and Hudi, and which compaction procedure to run.

Mar 3, 2026

Pipeline Observability8 min read

Detecting Silent Writers in S3-Backed Data Lakes: Firehose, Kinesis, MSK, Glue, and Spark Streaming

A silent writer is a pipeline that stops committing data to S3 without raising an error. Glue reports success, Firehose shows no failures, Airflow marks the task green. But the table is not being updated. Here is how to detect silent writers across every major S3-writing service using S3 access logs.

Feb 10, 2026

Data Lake Health7 min read

Athena Cost per Table: Attribution Using S3 Access Logs and CloudTrail

Athena charges per byte scanned. But the Athena console only tells you the total per-query scan size, not which table caused it, which team runs the most expensive queries, or which partition is scanned cold every time. S3 access logs give you that attribution layer.

Jan 20, 2026

Data Lake Health7 min read

Delta Lake _delta_log Bloat: Why Your Checkpoints Grow to 170 TB and How to Fix It

The Delta Lake transaction log lives in the _delta_log/ prefix of every Delta table. Without proper checkpoint and VACUUM configuration, this log accumulates JSON files, Parquet checkpoints, and orphaned data files that inflate storage costs and slow every reader that opens the table.

Dec 30, 2025

Data Lake Health6 min read

Hudi MOR Compaction Lag: How to Monitor It Without Touching the Writer

Apache Hudi MOR (Merge-on-Read) tables accumulate delta log files between compaction runs. As the log-to-base-file ratio grows, read amplification increases. Every query must merge more log files before returning results. Here is how to monitor compaction lag without instrumenting your Spark or Flink writers.

Dec 9, 2025

Data Lake Health8 min read

AWS S3 Tables vs Self-Managed Iceberg: Compaction Cost, Observability, and When to Switch

AWS S3 Tables is a managed Iceberg service that handles compaction, snapshot expiry, and orphan file removal automatically. But it comes with different pricing, reduced observability, and trade-offs for teams with complex table formats or existing tooling. Here is how to evaluate whether to switch.

Nov 18, 2025

Data Lake Health7 min read

Trino Query Cost Attribution: Joining Event-Listener Logs with S3 Access Logs

Trino does not expose per-query S3 costs natively. But every Trino query that reads Iceberg, Delta, or Hudi data generates S3 GET requests that appear in your access logs under the Trino connector's IAM role. Here is how to join Trino event-listener logs with S3 access logs to attribute query cost per table, per user, and per team.

Oct 28, 2025

Data Lake Health9 min read

Delta Lake Health Monitoring: What to Track and How to Find Issues Without Your Query Engine

Most Delta Lake health problems are invisible until they show up as slow queries or failed jobs. Here's how S3 access logs surface compaction lag, orphaned files, and checkpoint failures before they escalate.

Oct 7, 2025

S3 Monitoring8 min read

S3 Monitoring Beyond CloudWatch: Object-Level Visibility for Data Engineering Teams

CloudWatch tells you how much S3 storage you have. It doesn't tell you which tables are degrading, which pipelines have stopped writing, or which IAM roles are behaving unexpectedly. Here's what object-level visibility actually looks like.

Sep 16, 2025

Data Lake Health10 min read

Lakehouse Monitoring in 2026: Covering Iceberg, Delta Lake, and Hudi From One Place

Iceberg, Delta Lake, and Hudi all have different metadata models, compaction patterns, and failure modes. Here's how to monitor all three from a single place without running queries against each catalog.

Aug 26, 2025

Data Lake Health7 min read

Athena Monitoring From S3 Access Logs: Cold Partitions, Stale Results, and Scan Waste

Athena scan costs scale with how much data each query touches. S3 access logs reveal exactly which partitions are being hit, how often, and whether your results cache is doing anything useful.

Aug 5, 2025

Pipeline Observability8 min read

Data Pipeline Observability Without Instrumentation: How S3 Write Patterns Tell the Story

Adding observability to every ETL job takes time your team doesn't have. S3 write patterns already contain the signal you need to detect dead pipelines, cadence drift, and checkpoint failures.

Jul 15, 2025

Engineering7 min read

IAM Monitoring for AWS Data Teams: Who Is Accessing What and With What SDK

IAM roles, SDK versions, access frequency, and bucket boundaries, most data teams have no visibility into this layer until something goes wrong. Here's how to build that picture from S3 access logs.

Jun 24, 2025

S3 Monitoring9 min read

Data Lake Monitoring in 2026: What Good Looks Like Across S3 Environments

We analyzed petabytes of S3 usage across hundreds of data lake workloads to define what efficient S3 storage looks like in 2026, and where most teams still fall short.

Jun 3, 2025

S3 Monitoring7 min read

What S3 Access Logs Reveal About Your Data Lake That CloudWatch Hides

CloudWatch surfaces bucket-level metrics. S3 access logs tell you which tables are growing, which pipelines have stopped, and which roles are crossing boundaries. The difference matters.

May 13, 2025

Data Lake Health6 min read

S3 Data Transfer Monitoring: What Engineers Actually Need to Know

Data transfer costs are one of the most unpredictable parts of your AWS bill. Cross-region replication, CDN charges, and internal service calls all add up faster than you think.

Apr 22, 2025

Pipeline Observability8 min read

S3 Request Monitoring: How GET and PUT Patterns Expose Pipeline Problems

GET, PUT, and LIST requests are billed per operation on S3, and they add up fast at scale. More importantly, unusual request patterns are often the first sign a pipeline is broken.

Apr 1, 2025

S3 Monitoring5 min read

Object-Level S3 Monitoring vs Bucket Metrics: What the Difference Reveals

Bucket metrics tell you total spend. Object-level visibility tells you which tables, prefixes, and access patterns are driving it. Here's what you can and can't see at each level.

Mar 11, 2025

Data Lake Health9 min read

Delta Lake Health and S3 Lifecycle: What Your Monitoring Stack Misses

S3 lifecycle policies and Delta Lake compaction interact in ways most monitoring tools don't surface. Here's where the gaps are and how to close them.

Feb 18, 2025

Data Lake Health7 min read

S3 Storage Class Monitoring: Finding Cold Data Without Touching Your Catalog

S3 Intelligent-Tiering automates transitions but doesn't tell you what's cold, why, or whether it matches your access patterns. Object-level visibility gives you that picture first.

Jan 28, 2025

Engineering5 min read

Bucket-Level vs Object-Level S3 Monitoring: Why Engineers Need Both

Bucket-level metrics are fast and cheap. Object-level visibility is where the real signal lives. Here's when each matters and how to combine them effectively.

Jan 7, 2025

Data Lake Health6 min read

Small Files in S3: How Data Lake Monitoring Surfaces the Real Performance Cost

Every GET and LIST request on S3 has a price. Small files multiply your request count while frequent access patterns compound the cost. Here's how to see and fix it.

Dec 17, 2024

Pipeline Observability7 min read

How to Monitor S3 Access Patterns and Catch Pipeline Failures Early

S3 API call costs can quietly drain your budget. But more importantly, access patterns are one of the earliest signals of pipeline failure, schema drift, and data quality issues.

Nov 26, 2024

Pipeline Observability6 min read

Why Your AWS S3 Lifecycle Policies Might Be Costing You More Than You Think

Misconfigured lifecycle rules can end up costing more than doing nothing. Here's how to audit your existing policies and fix the patterns that silently inflate your S3 spend.

Nov 5, 2024