Lakehouse Time Travel Deletes History. You Need Forks & Permanence.

2025-04-24 8 min

Table of Contents

Viewing history is one thing. Safely changing its meaning and ensuring it lasts forever requires a fundamentally different approach.

Lakehouse time travel lets you look at the past. Neat. But try to change its meaning retroactively? Or guarantee you can see it forever? Suddenly, you’re fighting the system.

If I were a real time traveler, I could change things, create alternate timelines, and my history wouldn’t vanish after 30 days because of a VACUUM command.

That’s the core problem with today’s data lakehouse “time travel”. It offers a glimpse, but it’s read-only when it comes to past meaning, and worse, that history is designed to expire. Real history needs agency, corrections, and permanence.

📐 Who is this post for?

Architects & Engineers tired of history being immutable or deleted by VACUUM / expire_snapshots.
Data Engineers facing painful backfills because old snapshots are gone.
Anyone who believes “time travel” should let you fix the past reliably and trust it’s still there later.

If you’ve hit the wall with AS OF TIMESTAMP or lost history to cleanup jobs, you’re not alone. The snapshot model has fundamental limits.

(This builds on why dbt isn’t truly declarative and ETL wastes compute – because current foundations are forgetful.)

TL;DR - The Quick Take

Lakehouse Time Travel (Delta, Iceberg): Essential first step (ACID, viewing recent past). ✅
Big Limitations: Mostly read-only for fixing past meaning; history is ephemeral (deleted by cleanup).
Why? Versions physical data files, forgets the logic + context. Storing infinite file snapshots is costly, forcing deletion. Mixes up system/effective time.
Real Historical Control Needs:
- Bitemporality: Separating system time (“when recorded”) from effective time (“when true”).
- Logical Forks: Modeling corrections as alternative semantic histories.
- Durable Logic: Storing the lightweight how and why, allowing reconstruction even if old files are pruned.
Snapshot Pain Points: Hard/risky fixes, lost long-term reproducibility, obscured lineage, permanently deleted history by design.
Better Foundation (AprioriDB way): Combine bitemporality with versioning logic. Enables safe corrections (forks) & potentially infinite history.

⚠️ Stop settling for time travel with an expiration date. Demand control over history’s meaning and guarantee its permanence.

📚 What We’ll Cover

🙏 Lakehouses: Necessary, But Not Sufficient
🧠 The Core Problem: Files vs. Meaning
🛑 The Correction Headache: Why Fixing Past Meaning Hurts
🕰️ Bitemporality & Logical Forks: The Missing Tools
🗑️ The High Cost & Impermanence of File Versioning
🚀 A Better Approach: Manageable, Durable History
🔧 AprioriDB: Building True History Management

🙏 Lakehouses: Necessary, But Not Sufficient

Delta Lake, Apache Iceberg, and Hudi brought crucial order to data lakes: ACID, schema evolution, and basic AS OF TIMESTAMP queries. Being able to query yesterday’s state or rollback bad writes are vital safety nets. Let’s give credit where credit is due. But this “time travel” remains incomplete for managing the full complexity and permanence of history.

Today’s “time travel” is just a fancy archive. You don’t actually travel anywhere.

🧠 The Core Problem: Files vs. Meaning

The fundamental issue: Lakehouse versioning tracks physical files. The log points to sets of Parquet files representing the table’s state after each transaction.

It versions the result, forgetting the semantic cause – the exact logic, input versions, parameters, and effective time that created it. Knowing which files existed isn’t the same as knowing the business logic that produced them. Crucially, storing endless terabytes of old data files is often prohibitively expensive. This gap (physical state vs. semantic cause) and storage cost drive the limitations.

🛑 The Correction Headache: Why Fixing Past Meaning Hurts

Viewing recent physical history is easy. But correcting the meaning of past data reveals the pain points:

Nightmare Scenario: You discover a bug in logic used two weeks ago. You need to represent what 2025-04-10 data should have been.

In this situation, a responsible data platform has two goals:

Correct the past: Represent what 2025-04-10 data should have been.
Keep the past: Keep the record of the original mistake. Users have been building reports and dashboards using the bug for the past two weeks. It would be gaslighting to pretend it didn’t happen.

This seems like a paradox. How can you have both?

It depends on how you define “past”.

In other words, when you say “AS OF 2025-04-10”, do you mean “the data as of when the transaction was written” or “the data as of when the fact was true”?

The first shows what the data looked like in the system on 2025-04-10. The second shows what data about 2025-04-10 looks like now.

This distinction is subtle, and without it, there are subtle bugs:

You shouldn’t be able to correct what the system looked like in the past. That was true about the system, and users may have used that data. If users have questions about reports or dashboards that they generated using the system at that point in time, you have to keep it in order to answer questions.
You should be able to correct what data about 2025-04-10 looks like now. If today, the data you have about 2025-04-10 is wrong, you need to be able to fix it going forward.

🕰️ Bitemporality & Logical Forks: The Missing Tools

This correction mess stems from conflating two distinct times:

System Time (Transaction Time): When data was written to the database. (Log timestamp).
Effective Time (Valid Time): When the fact was true in the real world. (Business time).

Handling corrections auditably requires bitemporality: modeling both explicitly.

When you correct the past (“My understanding of Q1 sales as of today is Y, revising the X I recorded back then”), you create a logical fork.

A bitemporal system can represent both original and corrected views cleanly, anchored to the correct effective time (Q1) but distinguished by system time (when the correction was made).

AS OF SYSTEM TIME [2 weeks ago] AS OF EFFECTIVE TIME 2025-04-10 - correctly returns the buggy data about 2025-04-10 from 2 weeks ago. This is useful for audits.
AS OF SYSTEM TIME [today] AS OF EFFECTIVE TIME 2025-04-10 - correctly returns the corrected data about 2025-04-10. This is useful for fixes.

Lakehouses only have system time, leaving it up to users to implement effective time correctly. Most people don’t. And many users conflate the two, leading to bugs.

🗑️ The High Cost & Impermanence of File Versioning

Treating history solely as file snapshots has major consequences beyond corrections:

Lost Long-Term Reproducibility: Retrieving old files doesn’t guarantee you can reproduce the exact result by rerunning original logic under original conditions if dependencies or environments changed.
Obscured Lineage: Understanding why data exists requires tracing logic. File history obscures this semantic lineage.
History That Vanishes (VACUUM Nightmare): This is often the biggest pain. Storing endless large data files is expensive, so lakehouses require cleanup (VACUUM, expire_snapshots) that permanently deletes old snapshots and their underlying data files. Your time travel ability shrinks to your retention window (e.g., 30 days). Need data from 18 months ago for an audit? If snapshots expired, it’s gone forever. “Time travel” is finite by design.

Versioning physical files forces a harsh trade-off: pay exponentially more for storage, or permanently delete history.

🚀 A Better Approach: Manageable, Durable History

What if we versioned the lightweight logic and context instead of bulky result files?

Bitemporal Core: Track System Time and Effective Time explicitly.
Version Semantics: Durably store the expression (logic) and provenance (context: input versions, parameters, time). This metadata is tiny compared to data files.
Native Logical Forks: Model corrections as explicit, queryable forks in effective history.
Potential for Infinite History: Since the core semantic record is small, history can potentially be kept forever. Even if large data artifacts (result files) are eventually pruned for cost, they can always be deterministically recomputed from the stored logic/context when needed. Trade cheap metadata storage + potential recompute for deep history, avoiding permanent deletion.

The Payoffs: Safe corrections, true long-term reproducibility, deep semantic lineage, and durable history without forced deletion of the past.

🔧 AprioriDB: Building True History Management

Trying to bolt advanced time travel and safe corrections onto foundations not designed for them leads to the limitations and workarounds we see today.

At AprioriDB, we believe these capabilities require rethinking the database engine itself.

We’re building a system where semantic understanding, correction, reproducibility, and auditable history are native:

Semantic Transactions: Recording the evaluable logic and context.
Native Bitemporality: Tracking system and effective time explicitly throughout the engine.
Preserved Logic & Provenance: Enabling safe amendment and replay of history’s meaning.
Enforced Determinism: Guaranteeing reliable, repeatable operations for trustworthy reconstruction.

These principles allow AprioriDB to model history correctly and support true bitemporal querying, safe logical forks for corrections, and potentially permanent semantic history, going far beyond simple snapshots.

We’re tackling these hard problems head-on, laying the groundwork for data infrastructure that is both rigorous and forgiving. But we can’t do it alone.

We are actively seeking passionate engineers, database architects, and distributed systems experts to join us in building AprioriDB.

If you share the frustration with current limitations, if you’re excited by the challenge of creating truly trustworthy data systems, and if you want to build the foundation for the next generation of data infrastructure where time travel is more than a fancy archive – we want to hear from you.

It’s 2025. It’s time our tools caught up. Let’s build it together.

👉 Explore the vision, dive into the technical challenges, and connect with us at https://aprioridb.com.

Written by Jenny Kwan, co-founder and CTO of AprioriDB.

Let’s connect! Find me on Bluesky and LinkedIn.

What do you think?

Biggest headache correcting historical lakehouse data?
Has VACUUM / expire_snapshots retention caused problems?
What’s missing from current “time travel”?

Share your experiences below! 👇