The Delta Lake transaction log lives in the _delta_log/ prefix of every Delta table. Without proper checkpoint and VACUUM configuration, this log accumulates JSON files, Parquet checkpoints, and orphaned data files that inflate storage costs and slow every reader that opens the table.
How the Delta transaction log works
Delta Lake records every write, delete, update, and schema change as a JSON entry in _delta_log/. Every 10 commits, Delta creates a Parquet checkpoint file that summarizes the table state to that point. Readers use the most recent checkpoint plus any subsequent JSON entries to reconstruct the current table state. The more JSON entries and checkpoint files accumulate, the more a reader must process before accessing data.
How _delta_log bloat happens
Log bloat develops when: checkpoint files are missing or stale (every reader replays the full JSON log from the last checkpoint), VACUUM has not run or is configured with a retention longer than necessary, or streaming jobs write many small commits per hour, creating thousands of JSON files between checkpoints.
- High-frequency streaming writes: a Spark Streaming job committing every 30 seconds produces 2,880 JSON log entries per day per table
- Missing checkpoints: Delta creates checkpoints every 10 commits by default, but this can be disabled or mis-configured
- No VACUUM policy: without VACUUM, all historical log files and orphaned data files remain in _delta_log/ indefinitely
- Retention too long: VACUUM with a 30-day retention window keeps 30 days of every file version, not just the current state
Detecting _delta_log bloat from S3
S3 Inventory shows object count and total size for the _delta_log/ prefix separately from the data prefix. A healthy Delta table has a _delta_log/ that is a small fraction of its total size. When _delta_log/ represents more than 5% of total table size, or when the log prefix has more than 10,000 objects, bloat is significant.
How to fix it
- Run VACUUM with a realistic retention window: VACUUM tablename RETAIN 168 HOURS (7 days is the minimum safe default)
- Ensure checkpoint frequency is set correctly: spark.databricks.delta.checkpointInterval defaults to 10, increase for high-frequency tables
- Enable log compaction for streaming tables (Delta 3.1+): spark.databricks.delta.enableChangeDataFeed and log compaction options reduce log file accumulation
- Monitor _delta_log/ size vs data size ratio and alert when it exceeds 5%
What reCost tracks
- _delta_log/ object count and size per table, updated from S3 Inventory
- Checkpoint file age: time since the last .checkpoint.parquet file was written
- Alert when checkpoint age exceeds 24 hours for tables with more than 1,000 uncommitted log entries
- VACUUM recommendation with correct RETAIN parameter based on your table's write history
Connect reCost to your S3 environment in 5 minutes
No agents, no code changes. Just your S3 access logs and a complete picture of your data lake health.
Book a Demo