Every GET and LIST request on S3 has a price. Small files multiply your request count while frequent access patterns compound the cost. But the real cost of small files isn't the API bill,it's query performance. Here's how to see and fix it.

Why small files are expensive

A partition with 10,000 files, each 1 MB, and a partition with 10 files, each 1 GB, contain the same amount of data. But the first partition generates 1,000× more GET requests when scanned by a query engine,and requires 1,000× more file handles, metadata lookups, and task scheduling overhead in Spark or Athena.

For Athena specifically, small files can increase query execution time by 5-10× for the same data volume,and the scan cost is the same regardless of file size (you're charged for bytes scanned, not files touched).

How small files accumulate in data lakes

Streaming pipelines with micro-batch writes: Spark Structured Streaming writing every minute produces thousands of files per partition per day
Frequent incremental loads without compaction: daily ETL appending small batches to partitions without running OPTIMIZE or COMPACT
Failed writes that are retried: partial write attempts leave small incomplete files that accumulate without vacuum
INSERT OVERWRITE with small source tables: each overwrite produces a set of small files sized to the input, not the optimal output size

Detecting small file problems from S3 monitoring

S3 inventory data gives you object count and size per prefix. The key metrics for small file detection are:

Median object size per prefix: below 64 MB indicates a small file problem worth investigating
Object count growth rate vs byte volume growth rate: if object count is growing faster than byte volume, small files are accumulating
GET request rate to byte read ratio: high GET count relative to bytes indicates query engines are touching many small files
List operation frequency: frequent LIST requests against a prefix with many objects indicates metadata overhead from small file counts

The fix: compaction and lifecycle coordination

For Delta Lake tables: run OPTIMIZE on tables with small file ratios above your threshold. For Iceberg tables: use compaction procedures to bin-pack small files into larger targets. For Hudi MOR tables: ensure compaction is configured to run before the log-to-base-file ratio becomes too high.

The challenge is knowing which tables need compaction and how urgently. Without object-level monitoring, teams either run compaction on a fixed schedule everywhere (wasteful) or reactively after performance problems appear. With monitoring, you run compaction exactly where and when it's needed.

Preventing small file accumulation

For streaming workloads, setting a minimum file size threshold in Spark (spark.sql.files.maxRecordsPerFile combined with target file size configuration) prevents micro-batch writes from creating tiny files. For batch workloads, running OPTIMIZE as part of the pipeline's post-write step keeps file sizes managed without a separate compaction job.

SEE IT IN YOUR ENVIRONMENT

Connect reCost to your S3 environment in 5 minutes

No agents, no code changes. Just your S3 access logs and a complete picture of your data lake health.

Book a Demo

Back to Blog