A silent writer is a pipeline that stops committing data to S3 without raising an error. Glue reports success, Firehose shows no failures, Airflow marks the task green. But the table is not being updated. Here is how to detect silent writers across every major S3-writing service using S3 access logs.

Why silent writer detection is hard

Most pipeline monitoring is job-centric: did the Glue job complete? Did the Airflow task succeed? Job success does not imply data was written. A Glue ETL job that reads from an empty upstream partition will complete successfully with exit code 0 but write zero rows to the target Iceberg or Delta table. Firehose can buffer records that never flush if the buffer size is never reached. These are the hardest failures to catch because every monitoring signal says everything is fine.

How S3 write patterns reveal silent failures

Every write to S3, whether from Firehose, Kinesis Data Streams, MSK Connect, Glue, Spark Streaming, Flink, Airflow, or dbt, leaves a PUT request in the S3 access log. The last PUT timestamp per prefix per writer identity is a direct measure of pipeline liveness. When the last PUT timestamp exceeds your SLO, the pipeline is silent.

Amazon Data Firehose: writes to your target prefix at its configured buffer interval (1-900 seconds). Missing PUT events beyond 2x the buffer interval indicate a delivery failure.
Amazon Kinesis Data Streams: consumer writes (via KCL or Spark Streaming) appear as PUT requests from the consumer's IAM role. Cadence deviation signals consumer lag or failure.
MSK and MSK Connect: Kafka sink connectors write via their configured IAM role. S3 access logs identify the connector by role ARN.
AWS Glue ETL: Glue writes appear under the glue-service-role ARN. Zero PUT requests after the scheduled run window means the job ran but wrote nothing.
Apache Spark Streaming / Structured Streaming: writes appear under the EMR or Databricks cluster role. Checkpoint file writes (to the checkpoint prefix) must also occur regularly.
Apache Flink: writes via Flink's S3 sink connector under the Flink application's IAM role.
Airflow and dbt: trigger Spark or Glue jobs whose writes appear under those engines' roles.

Setting SLOs per writer

Effective silent writer detection requires per-writer SLOs: how long is it acceptable for this prefix to receive no writes? A Firehose stream delivering clickstream data has a 60-second SLO. A daily Glue batch job has a 26-hour SLO (24 hours plus a 2-hour buffer). reCost learns write cadence baselines from historical access log patterns and alerts when the silence window exceeds the inferred SLO.

What reCost detects

Last-write timestamp per table per writer identity (Glue role, Firehose ARN, Kinesis consumer role, etc.)
Inferred write cadence baseline per prefix over 14-day rolling window
Alert when last write exceeds 1.5x the inferred cadence SLO
Separate alerts for streaming writers (Firehose, Kinesis, Flink) vs batch writers (Glue, Airflow, dbt)

SEE IT IN YOUR ENVIRONMENT

Connect reCost to your S3 environment in 5 minutes

No agents, no code changes. Just your S3 access logs and a complete picture of your data lake health.

Book a Demo

Back to Blog

Detecting Silent Writers in S3-Backed Data Lakes: Firehose, Kinesis, MSK, Glue, and Spark Streaming

Why silent writer detection is hard

How S3 write patterns reveal silent failures

Setting SLOs per writer

What reCost detects

Connect reCost to your S3 environment in 5 minutes