dbt isnât truly declarative and ETL wastes compute â because current foundations are forgetful.)
â ď¸ Stop settling for time travel with an expiration date. Demand control over historyâs meaning and guarantee its permanence.
Delta Lake, Apache Iceberg, and Hudi brought crucial order to data lakes: ACID, schema evolution, and basic AS OF TIMESTAMP
queries. Being able to query yesterdayâs state or rollback bad writes are vital safety nets. Letâs give credit where credit is due. But this âtime travelâ remains incomplete for managing the full complexity and permanence of history.
Todayâs âtime travelâ is just a fancy archive. You donât actually travel anywhere.
The fundamental issue: Lakehouse versioning tracks physical files. The log points to sets of Parquet files representing the tableâs state after each transaction.
It versions the result, forgetting the semantic cause â the exact logic, input versions, parameters, and effective time that created it. Knowing which files existed isnât the same as knowing the business logic that produced them. Crucially, storing endless terabytes of old data files is often prohibitively expensive. This gap (physical state vs. semantic cause) and storage cost drive the limitations.
Viewing recent physical history is easy. But correcting the meaning of past data reveals the pain points:
Nightmare Scenario: You discover a bug in logic used two weeks ago. You need to represent what 2025-04-10
data should have been.
In this situation, a responsible data platform has two goals:
2025-04-10
data should have been.This seems like a paradox. How can you have both?
It depends on how you define âpastâ.
In other words, when you say âAS OF 2025-04-10â, do you mean âthe data as of when the transaction was writtenâ or âthe data as of when the fact was trueâ?
The first shows what the data looked like in the system on 2025-04-10
. The second shows what data about 2025-04-10
looks like now.
This distinction is subtle, and without it, there are subtle bugs:
2025-04-10
looks like now. If today, the data you have about 2025-04-10
is wrong, you need to be able to fix it going forward.This correction mess stems from conflating two distinct times:
Handling corrections auditably requires bitemporality: modeling both explicitly.
When you correct the past (âMy understanding of Q1 sales as of today is Y
, revising the X
I recorded back thenâ), you create a logical fork.
A bitemporal system can represent both original and corrected views cleanly, anchored to the correct effective time (Q1) but distinguished by system time (when the correction was made).
AS OF SYSTEM TIME [2 weeks ago] AS OF EFFECTIVE TIME 2025-04-10
- correctly returns the buggy data about 2025-04-10
from 2 weeks ago. This is useful for audits.AS OF SYSTEM TIME [today] AS OF EFFECTIVE TIME 2025-04-10
- correctly returns the corrected data about 2025-04-10
. This is useful for fixes.Lakehouses only have system time, leaving it up to users to implement effective time correctly. Most people donât. And many users conflate the two, leading to bugs.
Treating history solely as file snapshots has major consequences beyond corrections:
VACUUM
Nightmare): This is often the biggest pain. Storing endless large data files is expensive, so lakehouses require cleanup (VACUUM
, expire_snapshots
) that permanently deletes old snapshots and their underlying data files. Your time travel ability shrinks to your retention window (e.g., 30 days). Need data from 18 months ago for an audit? If snapshots expired, itâs gone forever. âTime travelâ is finite by design.Versioning physical files forces a harsh trade-off: pay exponentially more for storage, or permanently delete history.
What if we versioned the lightweight logic and context instead of bulky result files?
The Payoffs: Safe corrections, true long-term reproducibility, deep semantic lineage, and durable history without forced deletion of the past.
Trying to bolt advanced time travel and safe corrections onto foundations not designed for them leads to the limitations and workarounds we see today.
At AprioriDB, we believe these capabilities require rethinking the database engine itself.
Weâre building a system where semantic understanding, correction, reproducibility, and auditable history are native:
These principles allow AprioriDB to model history correctly and support true bitemporal querying, safe logical forks for corrections, and potentially permanent semantic history, going far beyond simple snapshots.
Weâre tackling these hard problems head-on, laying the groundwork for data infrastructure that is both rigorous and forgiving. But we canât do it alone.
We are actively seeking passionate engineers, database architects, and distributed systems experts to join us in building AprioriDB.
If you share the frustration with current limitations, if youâre excited by the challenge of creating truly trustworthy data systems, and if you want to build the foundation for the next generation of data infrastructure where time travel is more than a fancy archive â we want to hear from you.
Itâs 2025. Itâs time our tools caught up. Letâs build it together.
đ Explore the vision, dive into the technical challenges, and connect with us at https://aprioridb.com.
Written by Jenny Kwan, co-founder and CTO of AprioriDB.
Letâs connect! Find me on Bluesky and LinkedIn.
What do you think?
VACUUM
/ expire_snapshots
retention caused problems?Share your experiences below! đ