Technical posts on S3 monitoring, Delta Lake health, lakehouse observability, Athena tuning, and data pipeline debugging. For data engineers and platform teams.
Running expire_snapshots too aggressively can break time-travel queries your downstream consumers depend on. Here is how to set the retention window correctly and verify expiry ran without corrupting your table.
Iceberg manifest files are the index your query engines read before touching a single data file. When manifests bloat: thousands of small manifest files per snapshot, query planning slows significantly before any data is read. Here is how to detect manifest bloat and when rewrite_manifests actually helps.
Small files inflate query times and S3 request costs across every open table format. The symptoms look the same: slow scans, high GET counts, growing object counts. But the fix is format-specific. Here is how to detect the small files problem in Iceberg, Delta Lake, and Hudi, and which compaction procedure to run.
A silent writer is a pipeline that stops committing data to S3 without raising an error. Glue reports success, Firehose shows no failures, Airflow marks the task green. But the table is not being updated. Here is how to detect silent writers across every major S3-writing service using S3 access logs.
Athena charges per byte scanned. But the Athena console only tells you the total per-query scan size, not which table caused it, which team runs the most expensive queries, or which partition is scanned cold every time. S3 access logs give you that attribution layer.
The Delta Lake transaction log lives in the _delta_log/ prefix of every Delta table. Without proper checkpoint and VACUUM configuration, this log accumulates JSON files, Parquet checkpoints, and orphaned data files that inflate storage costs and slow every reader that opens the table.
Apache Hudi MOR (Merge-on-Read) tables accumulate delta log files between compaction runs. As the log-to-base-file ratio grows, read amplification increases. Every query must merge more log files before returning results. Here is how to monitor compaction lag without instrumenting your Spark or Flink writers.
AWS S3 Tables is a managed Iceberg service that handles compaction, snapshot expiry, and orphan file removal automatically. But it comes with different pricing, reduced observability, and trade-offs for teams with complex table formats or existing tooling. Here is how to evaluate whether to switch.
Trino does not expose per-query S3 costs natively. But every Trino query that reads Iceberg, Delta, or Hudi data generates S3 GET requests that appear in your access logs under the Trino connector's IAM role. Here is how to join Trino event-listener logs with S3 access logs to attribute query cost per table, per user, and per team.
Most Delta Lake health problems are invisible until they show up as slow queries or failed jobs. Here's how S3 access logs surface compaction lag, orphaned files, and checkpoint failures before they escalate.
CloudWatch tells you how much S3 storage you have. It doesn't tell you which tables are degrading, which pipelines have stopped writing, or which IAM roles are behaving unexpectedly. Here's what object-level visibility actually looks like.
Iceberg, Delta Lake, and Hudi all have different metadata models, compaction patterns, and failure modes. Here's how to monitor all three from a single place without running queries against each catalog.
Athena scan costs scale with how much data each query touches. S3 access logs reveal exactly which partitions are being hit, how often, and whether your results cache is doing anything useful.
Adding observability to every ETL job takes time your team doesn't have. S3 write patterns already contain the signal you need to detect dead pipelines, cadence drift, and checkpoint failures.
IAM roles, SDK versions, access frequency, and bucket boundaries, most data teams have no visibility into this layer until something goes wrong. Here's how to build that picture from S3 access logs.
We analyzed petabytes of S3 usage across hundreds of data lake workloads to define what efficient S3 storage looks like in 2026, and where most teams still fall short.
CloudWatch surfaces bucket-level metrics. S3 access logs tell you which tables are growing, which pipelines have stopped, and which roles are crossing boundaries. The difference matters.
Data transfer costs are one of the most unpredictable parts of your AWS bill. Cross-region replication, CDN charges, and internal service calls all add up faster than you think.
GET, PUT, and LIST requests are billed per operation on S3, and they add up fast at scale. More importantly, unusual request patterns are often the first sign a pipeline is broken.
Bucket metrics tell you total spend. Object-level visibility tells you which tables, prefixes, and access patterns are driving it. Here's what you can and can't see at each level.
S3 lifecycle policies and Delta Lake compaction interact in ways most monitoring tools don't surface. Here's where the gaps are and how to close them.
S3 Intelligent-Tiering automates transitions but doesn't tell you what's cold, why, or whether it matches your access patterns. Object-level visibility gives you that picture first.
Bucket-level metrics are fast and cheap. Object-level visibility is where the real signal lives. Here's when each matters and how to combine them effectively.
Every GET and LIST request on S3 has a price. Small files multiply your request count while frequent access patterns compound the cost. Here's how to see and fix it.
S3 API call costs can quietly drain your budget. But more importantly, access patterns are one of the earliest signals of pipeline failure, schema drift, and data quality issues.
Misconfigured lifecycle rules can end up costing more than doing nothing. Here's how to audit your existing policies and fix the patterns that silently inflate your S3 spend.