Data Lake Health8 min read · May 2026

Iceberg Orphan Files: Detection, Cleanup, and the Safe Order of Maintenance Operations

rC
reCost Team
May 2026

Orphan files in Apache Iceberg accumulate silently: Parquet files that live in S3 but are no longer referenced by any live snapshot. Here is how to detect them, understand how much storage they waste, and run maintenance in the order that does not break time-travel queries.

What are Iceberg orphan files?

Every write to an Apache Iceberg table creates new data files and a new snapshot entry. When a write fails mid-commit, or when old snapshots are expired without a subsequent cleanup step, the data files referenced by those snapshots become orphaned: they still exist in S3 but are no longer reachable from any live snapshot in the metadata tree.

Why orphan files accumulate silently

Orphan files are invisible to query engines. Athena, Trino, and Spark read only the files that the current snapshot references. An orphaned Parquet file sitting at s3://your-bucket/warehouse/db/table/data/part-00000-abc.parquet is indistinguishable from a healthy file by any tool that reads through the catalog. It will never appear in a DESCRIBE DETAIL or SHOW FILES result. Only by cross-referencing the full S3 object list against the manifest tree can you find them.

How to detect orphan files without a query engine

S3 Inventory gives you a complete object listing for your bucket, including every Parquet file by key and size. Cross-referencing that list against the manifest files for all live snapshots reveals any file present in S3 but absent from the manifest tree. This does not require running a Spark job or touching the catalog. It is a pure object-layer operation.

  • Pull S3 Inventory for the table prefix and list all .parquet and .avro files
  • Fetch the current snapshot's manifest-list file from the Iceberg metadata directory
  • Walk each manifest to collect the set of data file paths referenced by live snapshots
  • Any file in the inventory set but not in the manifest set is an orphan candidate
  • Subtract any files newer than your minimum snapshot age (to avoid flagging in-flight writes)

The safe order for Iceberg maintenance

The correct maintenance order is: (1) expire_snapshots first, (2) remove_orphan_files second, (3) rewrite_manifests third. Running remove_orphan_files before expiring snapshots risks deleting data files that old-but-still-live snapshots reference, breaking time-travel queries for the retention window you promised your users.

  • expire_snapshots: removes snapshot entries older than your retention period (default 5 days in Iceberg)
  • remove_orphan_files: deletes files unreachable from any remaining snapshot. Safe only after expiry runs.
  • rewrite_manifests: compacts manifest files for faster planning. Run last: operates on the clean state.

What reCost surfaces automatically

  • Orphan file count and wasted storage in GB per table, updated every 15 minutes
  • Time since last expire_snapshots run per table
  • Alert when orphan file volume exceeds a configurable threshold
  • Recommended CALL system.remove_orphan_files command with correct older_than parameter
SEE IT IN YOUR ENVIRONMENT

Connect reCost to your S3 environment in 5 minutes

No agents, no code changes. Just your S3 access logs and a complete picture of your data lake health.

Book a Demo