Adding observability to every ETL job takes engineering time your team doesn't have. Custom metrics, log aggregation, alerting pipelines,each one is a maintenance burden. But for pipelines that write to S3, there's a simpler approach: S3 write patterns already contain the signal you need.

The instrumentation problem

Most data pipeline observability guides recommend instrumenting each job with custom metrics: record counts, processing time, error rates. This works well for pipelines you control end-to-end. But in practice, data stacks include Glue jobs written by different teams, third-party connectors, custom ETL scripts, and streaming jobs,many of which have no instrumentation at all.

The result is a monitoring blind spot: you have visibility into the pipelines someone took the time to instrument, and nothing for the rest. And it's often the unmonitored pipelines that fail silently.

What S3 write patterns reveal

Any pipeline that writes to S3 leaves a write pattern: the frequency, volume, and timing of PUT and COPY operations against specific prefixes. This pattern is deterministic for healthy pipelines,a daily ETL that runs at 2am will produce a recognizable write signature in your S3 access logs.

Deviations from this baseline are observable without instrumentation:

No writes in a prefix that normally receives writes every day: dead pipeline
Write volume significantly below baseline: partial failure or upstream data loss
Writes arriving at wrong intervals: schedule drift or dependency issue
Write bursts with no corresponding reads: data is being written but not consumed
Checkpoint files stopped updating in a streaming job prefix: streaming checkpoint failure

Cadence-aware monitoring

The key insight is that pipeline health can be inferred from write cadence, not just write presence. A pipeline that writes once a day but has been writing twice a day for the past month is showing a behavioral change that might indicate a configuration issue or upstream change. A pipeline that normally writes 500MB per run but wrote 50MB yesterday might indicate a data quality problem upstream.

reCost establishes write cadence baselines per prefix and alerts on deviations: absolute (no writes in N days) and relative (write volume below X% of baseline).

The case for zero-instrumentation observability

In the case that surfaced in our case studies, four pipelines had been silent for 7 to 23 days. No alarm had fired because none of the pipelines were instrumented. The issue was only discovered when downstream dashboards were already serving 3-week-old data. S3 write pattern monitoring would have caught all four within hours of the first missed write.

What to monitor at the S3 layer for pipeline health

Write cadence per prefix: last write timestamp, write frequency baseline, deviation from baseline
Write volume per run: object count and byte volume per write window
Checkpoint file freshness for streaming jobs (Spark Structured Streaming, Flink checkpoints)
Idle prefix detection: prefixes that have moved from active to zero-write for more than N days
Multi-step pipeline flow: verify downstream prefixes receive writes after upstream writes complete

SEE IT IN YOUR ENVIRONMENT

Connect reCost to your S3 environment in 5 minutes

No agents, no code changes. Just your S3 access logs and a complete picture of your data lake health.

Book a Demo

Back to Blog