Your ETL Pipeline is Wasting Time (And It's Not Your Fault) — Part 1

2025-04-17 5 min

Table of Contents

Waste abounds when we build and rebuild blindly.

(This is Part 1 of a series on inefficiency in modern ETL. This post lays out the problem. Future parts will explore solutions. How many parts? Let’s see where this goes.)

ETL today often feels broken — not because we’re doing it wrong, but because our tools evolved to survive on foundations that force us to forget what actually changed.

📐 Who is this post for?

Data engineers sick of hours-long reruns for trivial changes.
Platform teams stuck juggling brittle partitions and cascading dependencies.
Architects watching compute costs explode.
Tech leaders wondering if constant backfills are a symptom, not a solution.

If you’ve ever asked, “Why are we rebuilding all this again when barely anything changed?” — you’re not imagining things. The problem is real.

(Previously, I argued that tools like dbt aren’t truly declarative because they operate on mutable foundations. Today, let’s explore how this leads to waste in modern ETL.)

TL;DR: The Problem Diagnosis

🗑️ ETL is wasting compute. We rebuild data even when nothing meaningful changed.
🤷 Why? Idempotent full rewrites became the safest pattern on databases that forget history.
💸 Consequences:
- Slow pipelines, high costs.
- Fragile partition logic.
- Risky rollbacks and inaccurate history.
🧠 The root cause: Our data platforms don’t remember how data was made. No context, no lineage, no time semantics. They forget.

This inefficiency is systemic. It’s designed in. And to fix it, we need to change the foundation.

📚 What This Post Covers

⏳ ETL’s Defensive Evolution
🛡️ Idempotency: From Safety to Overhead
💸 Why We Rebuild More Than We Should
💣 Backfills: Risky, Inaccurate, Expensive
🧠 The Core Problem: Foundations That Forget
🚀 What’s Next: Toward Remembering Systems

⏳ ETL’s Defensive Evolution

Modern ETL didn’t emerge from elegant design. It emerged from trauma.

Early pipelines on mutable databases suffered:

Half-finished writes left tables corrupted.
Concurrency issues introduced inconsistencies.
UPDATE/MERGE logic caused subtle bugs no one caught for weeks.

To survive, teams adopted defensive patterns — not because they were optimal, but because they were predictable. Over time, those workarounds calcified into best practices.

🛡️ Idempotency: From Safety to Overhead

Pain Point	Defensive Pattern	Result
Partial updates	Drop and recreate tables/partitions	Massive recomputation
Merge bugs	Insert-only + overwrite	Data duplication, complexity
Uncertain dependencies	Rerun entire jobs	Unnecessary recompute
No safe rollback	Backfill from raw data	Painfully slow recovery

Out of these defensive patterns, one principle rose to prominence: idempotency.

But we achieved it the lazy way: nuke and pave. Instead of surgically updating data, we drop and recreate it.

We don’t bulldoze because we want to; we bulldoze because our systems forgot how to diff.

💸 Why We Rebuild More Than We Should

This strategy brings big trade-offs:

We Abandon Efficient Ops

We avoid UPDATE, MERGE, and DELETE because they mutate data without a trail. Instead, we rewrite entire partitions using INSERT, often in staging tables before atomic swaps.

SQL gives us surgical tools. We use bulldozers because our platforms forgot the patient’s chart.

Partitioning Becomes a Crutch

Partitioning helps limit the blast radius, but introduces:

Complex duplication (like re-copying dimensions).
Headaches with overlapping time windows (e.g. 7-day averages).
Change a day? Track down every other day that uses that day.
Heuristic-driven reruns (often inaccurate, hard to validate).

We don’t partition because it’s elegant. We do it because we’re scared to touch what’s already there.

💣 Backfills: Risky, Inaccurate, Expensive

Full backfills are the natural conclusion of this model. They happen when we find a bug and rerun everything downstream to be “safe.”

But backfills are dangerous:

They Rewrite the Past With Today’s Logic

If your business logic changed since the error, backfilling can overwrite correct old meaning with new logic.

Recomputing April 10th data using April 17th logic? That’s not a fix. That’s a lie about what you knew back then.

They’re Hard to Target

Tracking affected downstream partitions is hard. Orchestrators like Airflow don’t know which logic ran when, or what changed between runs.

“Clear downstream” might re-run with the wrong code. Or skip something that needed rerunning. Either way: fragile.

They Burn Time and Money

Recomputing multi-day slices across a deep DAG? Painfully slow, dangerously disruptive.

Also, these massive backfills often blow past established processing windows.

A job that normally finishes by 6am might run until noon during a backfill, delaying critical downstream reports and dashboards, potentially causing a cascade of operational alerts and missed SLAs across the organization.

They Waste Developer Time

This constant rebuilding isn’t just costly in cloud credits:

It’s incredibly expensive in developer time.

When a simple logic change requires hours of recomputation before it can be validated, the feedback loop grinds to a halt.

Innovation slows down, bug fixes take longer, and engineers spend frustrating hours just waiting for pipelines that are redoing work they already did yesterday.

They Block The Business

Every hour spent rebuilding unchanged data is an hour users wait for fresh insights.

If your critical daily sales report takes six hours to generate via full partition rewrites, the business is making decisions on data that’s already significantly stale by the time it arrives.

They Impact the Entire System

Furthermore, these compute-heavy rewrite jobs can monopolize cluster resources.

This can throttle other pipelines or even slow down interactive queries for analysts trying to use the platform concurrently.

The waste from one pipeline impacts the performance of the entire system.

🧠 The Core Problem: Foundations That Forget

These aren’t surface-level inefficiencies. They’re symptoms of a deeper design flaw:

Our data platforms forget.

They store what is, but forget how it came to be.

Most systems lack:

Durable Expressions: The logic used to create a dataset version.
Provenance: The input versions, parameters, and context.
Bitemporal Awareness: Separation of system time (when it ran) and logical time (what time it represents).
Semantic Equivalence Recognition: Knowing that “same inputs + same logic = same result.”

Without these, even tiny changes become big problems. ETL frameworks have no choice but to rebuild broadly. Precision is impossible when the substrate forgets.

🚀 What’s Next: Toward Remembering Systems

What if we flipped the model?

What if our systems remembered the logic, the context, the time semantics; and could reason about change?

You’d get:

✨ Targeted recomputation — update only what truly changed.
⏳ Bitemporal corrections — fix logic from the past, without rewriting everything.
🧩 Deterministic reuse — if it already ran, don’t rerun it. Cache by structure.
🔍 Deep lineage — audit and debug by traversing the computation graph.

That’s where we’re heading in Part 2.

We’ll break down what a system built for semantic memory actually looks like, and how it helps ETL grow up.

Written by Jenny Kwan, co-founder and CTO of AprioriDB.

Follow me on Bluesky and LinkedIn.

What do you think of this diagnosis?

Does this resonate with the ETL frustrations you experience?
What’s the most costly or painful source of ETL waste you deal with regularly?
Are there other root causes you think are important?

Share your thoughts and war stories in the comments below! 👇