Lakehouse Time Travel Deletes History. You Need Forks & Permanence.

Lakehouse time travel lets you look at the past. Neat. But try to change its meaning retroactively? Or guarantee you can see it forever? Suddenly, youâre fighting the system.
If I were a real time traveler, I could change things, create alternate timelines, and my history wouldnât vanish after 30 days because of a VACUUM command.
Thatâs the core problem with todayâs data lakehouse âtime travelâ. It offers a glimpse, but itâs read-only when it comes to past meaning, and worse, that history is designed to expire. Real history needs agency, corrections, and permanence.
đ Who is this post for?
- Architects & Engineers tired of history being immutable or deleted by
VACUUM/expire_snapshots. - Data Engineers facing painful backfills because old snapshots are gone.
- Anyone who believes âtime travelâ should let you fix the past reliably and trust itâs still there later.
If youâve hit the wall with AS OF TIMESTAMP or lost history to cleanup jobs, youâre not alone. The snapshot model has fundamental limits.
(This builds on why dbt isnât truly declarative and ETL wastes compute â because current foundations are forgetful.)
TL;DR - The Quick Take
- Lakehouse Time Travel (Delta, Iceberg): Essential first step (ACID, viewing recent past). â
- Big Limitations: Mostly read-only for fixing past meaning; history is ephemeral (deleted by cleanup).
- Why? Versions physical data files, forgets the logic + context. Storing infinite file snapshots is costly, forcing deletion. Mixes up system/effective time.
- Real Historical Control Needs:
- Bitemporality: Separating system time (âwhen recordedâ) from effective time (âwhen trueâ).
- Logical Forks: Modeling corrections as alternative semantic histories.
- Durable Logic: Storing the lightweight how and why, allowing reconstruction even if old files are pruned.
- Snapshot Pain Points: Hard/risky fixes, lost long-term reproducibility, obscured lineage, permanently deleted history by design.
- Better Foundation (AprioriDB way): Combine bitemporality with versioning logic. Enables safe corrections (forks) & potentially infinite history.
â ď¸ Stop settling for time travel with an expiration date. Demand control over historyâs meaning and guarantee its permanence.
đ What Weâll Cover
- đ Lakehouses: Necessary, But Not Sufficient
- đ§ The Core Problem: Files vs. Meaning
- đ The Correction Headache: Why Fixing Past Meaning Hurts
- đ°ď¸ Bitemporality & Logical Forks: The Missing Tools
- đď¸ The High Cost & Impermanence of File Versioning
- đ A Better Approach: Manageable, Durable History
- đ§ AprioriDB: Building True History Management
đ Lakehouses: Necessary, But Not Sufficient
Delta Lake, Apache Iceberg, and Hudi brought crucial order to data lakes: ACID, schema evolution, and basic AS OF TIMESTAMP queries. Being able to query yesterdayâs state or rollback bad writes are vital safety nets. Letâs give credit where credit is due. But this âtime travelâ remains incomplete for managing the full complexity and permanence of history.
Todayâs âtime travelâ is just a fancy archive. You donât actually travel anywhere.
đ§ The Core Problem: Files vs. Meaning
The fundamental issue: Lakehouse versioning tracks physical files. The log points to sets of Parquet files representing the tableâs state after each transaction.
It versions the result, forgetting the semantic cause â the exact logic, input versions, parameters, and effective time that created it. Knowing which files existed isnât the same as knowing the business logic that produced them. Crucially, storing endless terabytes of old data files is often prohibitively expensive. This gap (physical state vs. semantic cause) and storage cost drive the limitations.
đ The Correction Headache: Why Fixing Past Meaning Hurts
Viewing recent physical history is easy. But correcting the meaning of past data reveals the pain points:
Nightmare Scenario: You discover a bug in logic used two weeks ago. You need to represent what 2025-04-10 data should have been.
In this situation, a responsible data platform has two goals:
- Correct the past: Represent what
2025-04-10data should have been. - Keep the past: Keep the record of the original mistake. Users have been building reports and dashboards using the bug for the past two weeks. It would be gaslighting to pretend it didnât happen.
This seems like a paradox. How can you have both?
It depends on how you define âpastâ.
In other words, when you say âAS OF 2025-04-10â, do you mean âthe data as of when the transaction was writtenâ or âthe data as of when the fact was trueâ?
The first shows what the data looked like in the system on 2025-04-10. The second shows what data about 2025-04-10 looks like now.
This distinction is subtle, and without it, there are subtle bugs:
- You shouldnât be able to correct what the system looked like in the past. That was true about the system, and users may have used that data. If users have questions about reports or dashboards that they generated using the system at that point in time, you have to keep it in order to answer questions.
- You should be able to correct what data about
2025-04-10looks like now. If today, the data you have about2025-04-10is wrong, you need to be able to fix it going forward.
đ°ď¸ Bitemporality & Logical Forks: The Missing Tools
This correction mess stems from conflating two distinct times:
- System Time (Transaction Time): When data was written to the database. (Log timestamp).
- Effective Time (Valid Time): When the fact was true in the real world. (Business time).
Handling corrections auditably requires bitemporality: modeling both explicitly.
When you correct the past (âMy understanding of Q1 sales as of today is Y, revising the X I recorded back thenâ), you create a logical fork.
A bitemporal system can represent both original and corrected views cleanly, anchored to the correct effective time (Q1) but distinguished by system time (when the correction was made).
AS OF SYSTEM TIME [2 weeks ago] AS OF EFFECTIVE TIME 2025-04-10- correctly returns the buggy data about2025-04-10from 2 weeks ago. This is useful for audits.AS OF SYSTEM TIME [today] AS OF EFFECTIVE TIME 2025-04-10- correctly returns the corrected data about2025-04-10. This is useful for fixes.
Lakehouses only have system time, leaving it up to users to implement effective time correctly. Most people donât. And many users conflate the two, leading to bugs.
đď¸ The High Cost & Impermanence of File Versioning
Treating history solely as file snapshots has major consequences beyond corrections:
- Lost Long-Term Reproducibility: Retrieving old files doesnât guarantee you can reproduce the exact result by rerunning original logic under original conditions if dependencies or environments changed.
- Obscured Lineage: Understanding why data exists requires tracing logic. File history obscures this semantic lineage.
- History That Vanishes (
VACUUMNightmare): This is often the biggest pain. Storing endless large data files is expensive, so lakehouses require cleanup (VACUUM,expire_snapshots) that permanently deletes old snapshots and their underlying data files. Your time travel ability shrinks to your retention window (e.g., 30 days). Need data from 18 months ago for an audit? If snapshots expired, itâs gone forever. âTime travelâ is finite by design.
Versioning physical files forces a harsh trade-off: pay exponentially more for storage, or permanently delete history.
đ A Better Approach: Manageable, Durable History
What if we versioned the lightweight logic and context instead of bulky result files?
- Bitemporal Core: Track System Time and Effective Time explicitly.
- Version Semantics: Durably store the expression (logic) and provenance (context: input versions, parameters, time). This metadata is tiny compared to data files.
- Native Logical Forks: Model corrections as explicit, queryable forks in effective history.
- Potential for Infinite History: Since the core semantic record is small, history can potentially be kept forever. Even if large data artifacts (result files) are eventually pruned for cost, they can always be deterministically recomputed from the stored logic/context when needed. Trade cheap metadata storage + potential recompute for deep history, avoiding permanent deletion.
The Payoffs: Safe corrections, true long-term reproducibility, deep semantic lineage, and durable history without forced deletion of the past.
đ§ AprioriDB: Building True History Management
Trying to bolt advanced time travel and safe corrections onto foundations not designed for them leads to the limitations and workarounds we see today.
At AprioriDB, we believe these capabilities require rethinking the database engine itself.
Weâre building a system where semantic understanding, correction, reproducibility, and auditable history are native:
- Semantic Transactions: Recording the evaluable logic and context.
- Native Bitemporality: Tracking system and effective time explicitly throughout the engine.
- Preserved Logic & Provenance: Enabling safe amendment and replay of historyâs meaning.
- Enforced Determinism: Guaranteeing reliable, repeatable operations for trustworthy reconstruction.
These principles allow AprioriDB to model history correctly and support true bitemporal querying, safe logical forks for corrections, and potentially permanent semantic history, going far beyond simple snapshots.
Weâre tackling these hard problems head-on, laying the groundwork for data infrastructure that is both rigorous and forgiving. But we canât do it alone.
We are actively seeking passionate engineers, database architects, and distributed systems experts to join us in building AprioriDB.
If you share the frustration with current limitations, if youâre excited by the challenge of creating truly trustworthy data systems, and if you want to build the foundation for the next generation of data infrastructure where time travel is more than a fancy archive â we want to hear from you.
Itâs 2025. Itâs time our tools caught up. Letâs build it together.
đ Explore the vision, dive into the technical challenges, and connect with us at https://aprioridb.com.
Written by Jenny Kwan, co-founder and CTO of AprioriDB.
Letâs connect! Find me on Bluesky and LinkedIn.
What do you think?
- Biggest headache correcting historical lakehouse data?
- Has
VACUUM/expire_snapshotsretention caused problems? - Whatâs missing from current âtime travelâ?
Share your experiences below! đ