dbt aren’t truly declarative because they operate on mutable foundations. Today, let’s explore how this leads to waste in modern ETL.)
This inefficiency is systemic. It’s designed in. And to fix it, we need to change the foundation.
Modern ETL didn’t emerge from elegant design. It emerged from trauma.
Early pipelines on mutable databases suffered:
UPDATE
/MERGE
logic caused subtle bugs no one caught for weeks.To survive, teams adopted defensive patterns — not because they were optimal, but because they were predictable. Over time, those workarounds calcified into best practices.
Pain Point | Defensive Pattern | Result |
---|---|---|
Partial updates | Drop and recreate tables/partitions | Massive recomputation |
Merge bugs | Insert-only + overwrite | Data duplication, complexity |
Uncertain dependencies | Rerun entire jobs | Unnecessary recompute |
No safe rollback | Backfill from raw data | Painfully slow recovery |
Out of these defensive patterns, one principle rose to prominence: idempotency.
But we achieved it the lazy way: nuke and pave. Instead of surgically updating data, we drop and recreate it.
We don’t bulldoze because we want to; we bulldoze because our systems forgot how to diff.
This strategy brings big trade-offs:
We avoid UPDATE
, MERGE
, and DELETE
because they mutate data without a trail. Instead, we rewrite entire partitions using INSERT
, often in staging tables before atomic swaps.
SQL gives us surgical tools. We use bulldozers because our platforms forgot the patient’s chart.
Partitioning helps limit the blast radius, but introduces:
We don’t partition because it’s elegant. We do it because we’re scared to touch what’s already there.
Full backfills are the natural conclusion of this model. They happen when we find a bug and rerun everything downstream to be “safe.”
But backfills are dangerous:
If your business logic changed since the error, backfilling can overwrite correct old meaning with new logic.
Recomputing April 10th data using April 17th logic? That’s not a fix. That’s a lie about what you knew back then.
Tracking affected downstream partitions is hard. Orchestrators like Airflow don’t know which logic ran when, or what changed between runs.
“Clear downstream” might re-run with the wrong code. Or skip something that needed rerunning. Either way: fragile.
Recomputing multi-day slices across a deep DAG? Painfully slow, dangerously disruptive.
Also, these massive backfills often blow past established processing windows.
A job that normally finishes by 6am might run until noon during a backfill, delaying critical downstream reports and dashboards, potentially causing a cascade of operational alerts and missed SLAs across the organization.
This constant rebuilding isn’t just costly in cloud credits:
It’s incredibly expensive in developer time.
When a simple logic change requires hours of recomputation before it can be validated, the feedback loop grinds to a halt.
Innovation slows down, bug fixes take longer, and engineers spend frustrating hours just waiting for pipelines that are redoing work they already did yesterday.
Every hour spent rebuilding unchanged data is an hour users wait for fresh insights.
If your critical daily sales report takes six hours to generate via full partition rewrites, the business is making decisions on data that’s already significantly stale by the time it arrives.
Furthermore, these compute-heavy rewrite jobs can monopolize cluster resources.
This can throttle other pipelines or even slow down interactive queries for analysts trying to use the platform concurrently.
The waste from one pipeline impacts the performance of the entire system.
These aren’t surface-level inefficiencies. They’re symptoms of a deeper design flaw:
Our data platforms forget.
They store what is, but forget how it came to be.
Without these, even tiny changes become big problems. ETL frameworks have no choice but to rebuild broadly. Precision is impossible when the substrate forgets.
What if we flipped the model?
What if our systems remembered the logic, the context, the time semantics; and could reason about change?
You’d get:
That’s where we’re heading in Part 2.
We’ll break down what a system built for semantic memory actually looks like, and how it helps ETL grow up.
Written by Jenny Kwan, co-founder and CTO of AprioriDB.
Follow me on Bluesky and LinkedIn.
What do you think of this diagnosis?
Share your thoughts and war stories in the comments below! 👇