DATA LAKE OBSERVABILITY

Catch broken pipelines and bad tables before queries break

S3-native data lake observability for Iceberg, Delta Lake, and Apache Hudi. reCost reads S3 access logs and S3 Inventory to score table health, catch stale writers, and attribute storage and query cost to the table that caused it. No agents, no catalog access.

THE PROBLEM

Your queries are getting slower and you don't know why

Tables silently degrade

Snapshots accumulate, manifests bloat, small files multiply. By the time queries slow, you've been bleeding cost for weeks.

Pipelines fail invisibly

Glue jobs report success but write zero rows. Firehose stops. Nobody notices until an analyst opens a stale dashboard.

Cost is unattributable

You see $40K in Athena scans but can't tell which table, query, or team caused it.

"We had 42,015 snapshots on a single Iceberg table. Expiry had never run. Query planning was costing us on every scan."

SE

Staff Data Engineer

Fintech, Series C

WHAT RECOST SHOWS YOU

Three lenses. One S3 data source.

Table Health
TableScoreAction
events.raw_clickstream
42,015 snapshots
12
expire_snapshots
billing.transactions_v2
128 MB avg file
96
users.profiles_iceberg
2,847 small files
61
rewrite_manifests
ml.feature_store_delta
6,412 orphan files
44
remove_orphan_files

Iceberg, Delta, and Hudi health from metadata

  • Snapshot growth, manifest-list size, small-files counts per table
  • Detects when expire_snapshots, remove_orphan_files, rewrite_manifests, or VACUUM is overdue
  • Orphan-file detection: finds Parquet files not referenced by any live snapshot, with wasted-storage dollar amount
Safe maintenance order
expire_snapshots first, then remove_orphan_files, then rewrite_manifests. Reversing the order can corrupt time-travel history.

Last-write tracking and silent-writer alerts

  • Last-write timestamp per table, broken down by writer identity
  • Supports Firehose, Kinesis, MSK, Glue, Spark Streaming, Flink, Airflow, dbt
  • Pages you when a writer misses its SLO, before downstream consumers notice
Pipeline Cadence
glue-etl-prod
Glue · SLO 8h
2h ago
kinesis-events
Kinesis · SLO 1h
45m ago
airflow-daily
Airflow · SLO 26h
silent 38h ago
firehose-logs
Firehose · SLO 30m
12m ago
spark-batch
Spark · SLO 12h
6h ago
dbt-transform
dbt · SLO 24h
silent 72h ago
Query Cost Breakdown
Table · EngineScanCost
events.raw_clickstream
Athena · 2.1M GET
847 GB$84.70
analytics.sessions_delta
Trino · 1.8M GET
621 GB$62.10
billing.transactions_v2
Spark · 980K GET
412 GB$41.20
users.profiles_iceberg
Glue · 710K GET
298 GB$29.80
ml.feature_store_delta
Databricks · 440K GET
187 GB$18.70

Query observability across engines

  • Athena, Trino, Glue, Spark, EMR, Databricks - scan size, GET requests, bytes read per query
  • Per-table cost attribution mapped to writer, query, and team
  • Filesize and partition-skew detection that flags compaction debt before it compounds
HOW IT WORKS
01
Connect
Read-only IAM role. 5-min setup. No agents.
02
Scan
S3 access logs + S3 Inventory + table metadata.
03
Score
Health score per table, alert per writer.
04
Act
Slack/webhook on issues, recommended commands to run.
CASE STUDY
Featured

15.6 TB of Orphaned Files in S3, Invisible for 8 Months

How a Series C fintech discovered 15.6 TB of unreferenced Parquet files accumulating silently in their Apache Iceberg tables, and fixed it in a day.

Read Full Case Study

Stop flying blind on your data lake

5-minute setup. No agents. Works with your existing AWS stack.

Book a Demo