AprioriDB: A Manifesto for Reproducible, Trustworthy Data Systems
This document is publicly disclosed as prior art to establish the novelty of the system architecture and underlying mechanisms.
v.0.2.0 (2025-05-09)
© 2025 Jennifer Kwan. All rights reserved.
Prologue: Why AprioriDB
Before I explain what AprioriDB is — and how it works — I want to tell you why it matters to me.
This didn’t come out of nowhere.
It’s 2025 now. Almost two decades ago, I was the technical lead of a business intelligence (BI) platform. Back then, “analytics” wasn’t a buzzword yet. I wasn’t a manager. I didn’t have a degree. I didn’t even have formal education past high school.
I wouldn’t have been hired into central IT.
But I found another way in — through shadow IT.
The BI platform I ran belonged to a business unit inside a non-tech megacorp, far from central IT’s control. At the time, “business intelligence” and “management reporting” were new ideas, not yet part of corporate orthodoxy. Central IT didn’t support it. But the megacorp was big enough that each unit had its own tech budget. Little “shadow IT” teams formed — fast-moving, local to the business, a little rogue.
And, of course, there were politics. Always politics.
I was a self-taught engineer with a deep sense of ownership. Somehow, I ended up responsible for a Kimball-style data warehouse fronted by a popular ROLAP engine. (If those words mean nothing to you — don’t worry. They barely mean anything now.)
I wrote and maintained the ETL. I managed the metadata layer for a SQL-generating BI tool — a kind of early Looker, before Looker.
Life was good.
The Story of Bob and Jim
One month, I produced a set of reports — the same ones I generated every month.
Budget season was underway. One VP didn’t like the numbers. Let’s call him Bob.
Another VP did. Let’s call him Jim.
Jim used those numbers to make the case that his unit should grow — ideally at Bob’s expense.
”Look at the data,” he said. “Bob’s underperforming.”
Bob didn’t like that. But he didn’t have the political capital to push back directly.
So he did the next best thing: he questioned the numbers themselves.
After all, he’s Bob. There’s no way he was doing that badly.
So the burden fell on me. Prove that the numbers were correct.
Prove that Bob really was underperforming.
Prove that the report wasn’t broken.
I was the ant in the middle of two giants.
Jim doubled down on me. He became invested — deeply — in the correctness of my work.
”She’s never been wrong before,” he said.
And you’d think, as the technical lead of a BI platform, I could defend a report.
Any competent analyst should be able to explain where their numbers came from, right?
Long story short: I couldn’t.
Our ETL ran nightly.
Storage was expensive back then — no table partitioning, no backups, no snapshots.
Just a mutable database that changed every night.
We kept the raw input files, but our code wasn’t under version control — we had no reliable record of what logic was run, or when.
- Git didn’t exist yet. Subversion was still new. Perforce was expensive.
- We couldn’t say with certainty what code had been run, or in what order.
- And even if we could, the ETL engine didn’t enforce serial execution.
There was no isolation level.
Just side effects and best-effort scheduling.
So even if I tried to rerun the code, I wouldn’t get the same results.
And that meant I couldn’t reproduce the report — not exactly.
I couldn’t say for sure why the number was what it was.
I couldn’t trace it.
I couldn’t defend it.
If I’d had the original code, with a real execution trace and no side effects? I could’ve spent a few days and tied everything down to the penny. But I didn’t.
Jim was furious. Bob, triumphant. “There’s no proof,” he said. “It would be unfair to cut my budget.”
Both of them got headcount. And to prevent something like this from happening again, they each built their own analyst teams — completely decentralized, each doing their own data work on Excel and Access.
And my team?
We were still there.
But something broke.
We lost executive sponsorship.
The platform stalled out.
That scenario, born from 2000’s technology, echoes today.
Imagine a critical A/B test result being questioned before a product launch, an ML model’s surprising prediction triggering a compliance audit, or a financial forecast differing wildly from expectations.
When the stakes are high, the question remains the same: can you prove exactly where your numbers came from? Can you replay the logic deterministically? Too often, buried in layers of modern tooling, the answer is still a painful ‘no’.
Optimism and Faith
“But it’s been twenty years,” you say. “Surely the technology is better now.”
“No,” I say. “The tech is flashier, yes — we have cloud data warehouses, complex orchestration tools like Airflow or Dagster, machine learning pipelines generating features and predictions. But the fundamental problem of trustworthy, reproducible history often hasn’t changed. In fact, it’s arguably worse.”
Why? The scale is bigger. The pipelines are longer, stitching together SaaS APIs, microservices, streaming data, feature stores, and ML models. Our old Kimball warehouse might be a sprawling data lake now, our simple ETL replaced by complex DAGs glued together with best guesses and bash scripts. There are simply more steps, more systems, more places for history to get lost or corrupted.
There are more opportunities to get burned.
“But no one expects reports to tie to the penny.”
That depends on what’s at stake.
In compliance, health, or finance — sometimes a single penny does matter.
Sometimes your report is the basis for a fine, a budget cut, or someone’s bonus.
“But I won’t work anywhere that political.”
Ah. You have more faith in humanity than I do.
I trust machines. Machines that are properly designed, built, and maintained.
Which brings us to AprioriDB.
1. What Most Databases Forget
The core reason I couldn’t defend Bob’s numbers, the reason that trust evaporated, stems from a fundamental design choice in most databases: they primarily care about the present. They forget.
You can ask them what a table looks like right now.
You can ask for a summary, a count, a join — as long as it’s about the current state.
But if you ask:
- What did this row look like yesterday? (Essential for audits and trend analysis)
- What was true before this update? (Critical for debugging unexpected changes)
- What changed between then and now? (Needed to understand system evolution)
Most systems will shrug.
That information is gone — overwritten the moment something new arrived, leaving critical gaps in traceability.
This isn’t a bug.
It’s how databases have always worked.
They’re built for durability and availability.
You store values. You retrieve values.
And unless you’ve built something on top to preserve history — a snapshot, an audit table, an event stream — then the past is lost by default. This forces organizations into building complex and often brittle layers on top—manual snapshots, elaborate audit tables, separate event sourcing systems—just to reclaim a partial, often inconsistent, view of history. It’s an afterthought, not a foundation.
And in many use cases, that’s fine.
But not in ours.
If you care about traceability and reproducibility—if you need to understand how a result was produced, not just what it is now—then knowing only the current state isn’t enough.
You need to see the steps.
You need to know what changed, when it changed, and why the result looks the way it does now.
You need a way to look back.
So let’s start small.
What if a database simply… remembered everything?
Not just the final state, but every change.
Every insert, update, delete — recorded in order.
That idea has a name.
It’s called a transaction log.
And it’s where we begin.
2. Rebuilding the Past from a Log
Imagine if your database didn’t just hold tables — it kept a log.
Every change, in order.
Every insert. Every update. Every delete.
Nothing overwritten. Nothing lost.
If you had that — and if you had the original starting state — you could rebuild the database exactly as it looked at any point in time. You’d just replay the log, step by step, until you reached the moment you care about.
That’s the basic idea behind time travel in data systems.
It’s simple. And powerful.
But only if the log actually works the way you think it does.
Let’s say the log records each transaction, one after another.
Each transaction describes:
- What it did,
- What time it happened,
- And maybe even what query was used.
So far so good. But here’s where it starts to get tricky.
📝 Side note: What kind of log are we talking about?
Not a low-level log of disk writes (like a Write-Ahead Log). Not binary diffs. Not raw data pages or block IDs. While those are essential for crash recovery, they don’t capture the meaning of the changes in a way that’s easily reusable for reproducible analysis.
Instead, we mean a log of statements — the actual operations the system executed, like
INSERT
,UPDATE
,COPY FROM
. Each transaction captures the intent, not just the physical outcome: what was queried, what logic was applied. It’s a log of meaning, not mechanics.This focus on semantic meaning, rather than physical storage details, is what allows us to reason about reproducibility and eventually lineage at a higher level. As a practical benefit, such a semantic log is often significantly smaller and lighter-weight than a detailed physical log.
What If Two Things Happen at Once?
Let’s say two users make changes to the same table at the same time:
- One writes a new row.
- One reads the table and generates a report.
What should the reader see?
Ideally, they either:
- See the table before the new row is added, or
- See the table after the row is fully written.
But there’s a third, more dangerous case:
They see part of the write. The row halfway through.
An in-between state that never truly existed — not before, not after — just a glitch in timing.
The system allowed two actions to overlap in an unsafe way.
This is called a race condition — a bug caused by uncontrolled timing.
It’s what happens when the outcome depends on the timing of events — and that timing isn’t controlled.
It’s one of the hardest bugs to detect, and one of the easiest ways to lose trust in a system.
If your log records that the report ran at 2:03 PM, and the write was committed at 2:02 PM, you might assume everything is fine.
But if the report saw a corrupted intermediate state — then that report is now a phantom.
You’ll never get the same result again, even if you replay the exact same log.
This is the risk of concurrency.
Modern databases allow many things to happen at once.
That’s what makes them fast.
But speed comes at a cost:
If you don’t define how those actions should interact — and enforce that rule consistently — you can’t trust what the system tells you.
And if you can’t trust the replay, you’ve lost your way to look back.
The First Rule: A Clear Order
So the first real rule is this:
Every transaction must behave as if it happened in a clear, well-defined order — one at a time, with no ambiguity.
This property is called serializability.
It means that even if operations happen in parallel, the outcome is the same as if they had been run one after the other — cleanly, without overlap, without peeking into each other’s work.
Think about accessing a shared bank account balance. Imagine the current balance is $100.
- Process A wants to deposit $50. It reads the balance ($100).
- At the same time, Process B wants to withdraw $70. It also reads the balance ($100).
Now, without serializability, they might interleave dangerously:
- Process A calculates its new balance: $100 + $50 = $150.
- Process B sees it has enough funds (based on its initial read of $100) and calculates its new balance: $100 - $70 = $30.
- Process A writes $150 to the balance.
- Process B writes $30 to the balance, overwriting A’s deposit.
- The final balance is $30. $50 have “disappeared”.
Another type of error (“double spend”) could occur if both processes checked the balance, saw $100, authorized separate $70 withdrawals based on that outdated information, and the system somehow allowed both actions to proceed, potentially leading to an unexpected overdraft.
Serializability prevents this. It guarantees that the outcome is the same as if one entire transaction finished before the other started. So, either:
- (A completes, then B starts): Balance becomes $150, then B reads $150, withdraws $70, leaving $80.
- (B completes, then A starts): Balance becomes $30, then A reads $30, deposits $50, leaving $80.
In both valid serializable outcomes, the final balance is $80, and no operations are lost. The effect is strictly sequential, preserving the integrity of the account balance, even if the system executed parts in parallel for performance.
📝 Side note: For AprioriDB, this serial order doesn’t need to reflect real-time during initial execution. It just needs to be a valid serializable order — one that the database could have used.
But once that order is chosen and written to the log, it becomes the system’s truth.
From that point on, replaying the log must honor that exact order, with strict serializability. That’s how we preserve trust and reproducibility.
Serializability Isn’t the Whole Story
There’s a stricter form of this rule — one that also respects real time.
If transaction B starts after transaction A finishes, then B must appear to happen after A in the serial order.
This version is called strict serializability, or sometimes linearizability.
It combines the ordering guarantees of serializability with a sense of real-time cause and effect.
Why does that matter?
Because if someone runs a report after a change was committed, they expect to see that change.
Not eventually. Not probably. Not “if the system caught up.”
If you want trust — if you want a reproducible system of record — you need strict serializability enforced during replay. That’s the guarantee AprioriDB provides, ensuring that replaying history faithfully reflects the ‘happened-before’ relationship recorded in the log.
And that’s the bar we set for AprioriDB.
In the next section, we’ll unpack what that means for system design — and what goes wrong when that bar isn’t met.
3. Replay Is a Contract
If we’re going to rebuild the past from a log, we need to get something very clear:
Replay isn’t just a tool. It’s a contract.
If you read a table at system time T, and then read it again tomorrow by replaying the log up to time T, you should get the same answer. Always.
This sounds obvious. But it places some very real constraints on how the system is allowed to behave.
Let’s start with a simple question:
What exactly are we replaying?
Not bytes. Not blocks. Not diffs.
We’re replaying statements — high-level operations like INSERT
, UPDATE
, or complex transformations.
The actual queries and transformations that were originally run.
These statements aren’t just actions — they’re expressions of meaning.
If we’re going to trust them, we need to make sure they always evaluate the same way, given the same inputs.
That means: no hidden state.
No randomness.
No calls out to external systems.
No mutations of global state that live outside the log.
This is a big deal.
Why Side Effects Break Replay
Let’s say a transaction runs a query that calls a web API to fetch some data.
It stores the result in a table.
The log captures the query as it was written.
Now, weeks later, you replay the log.
That same query runs again. It makes the same API call.
But the external system gives you a different answer this time.
Now your replay has diverged.
The system is lying to you — and it’s doing so by accident.
The problem isn’t with the query itself.
It’s with the fact that the query had a side effect — something that lived outside the log, that changed between then and now.
The log can only preserve truth if everything that affects the result is either in the log or fully controlled by it.
This Leads Us to Determinism
For replay to be trustworthy, every transaction must behave deterministically.
That means:
- Given the same inputs, it produces the same outputs,
- It doesn’t depend on time, randomness, or any uncontrolled external system,
- Everything it needs to compute its result is either stored in the log, or explicitly referenced by it in a versioned way.
A simple example:
If a transaction uses a function like RAND()
or UUID()
, then each time the statement is evaluated, it produces a different result.
That’s fine during normal execution — but if the same statement shows up in the log and gets replayed later, it won’t behave the same way.
Unless we also store the original output of those functions as part of the log entry, the transaction becomes non-reproducible.
This is why AprioriDB enforces a strict rule:
No hidden behavior.
No clock access generating different values on replay. No randomness seeded unpredictably. No side effects reaching outside the system’s control.
Only logic + inputs = outputs.
This strict determinism has significant implications for how functions and procedures can be implemented within AprioriDB, ensuring that any potential source of variability is explicitly managed and logged.
Later in this paper, we’ll detail exactly how determinism is achieved in AprioriDB.
For now, let’s talk about data inputs.
Inputs Must Be Versioned Too
If a transaction references an external input — say, a file on disk or a blob in cloud storage — then that input must be versioned and retained. A statement like COPY FROM 's3://bucket/file.csv'
is not enough by itself. The file might change. It might disappear.
If we want to replay the transaction accurately, we need to know exactly what content that path referred to at the time it was used.
That’s why AprioriDB captures not just the semantic statements, but also the inputs they reference — potentially by storing a content hash or even the data itself, ensuring it’s versioned and immutable.
We store what the system saw, not just what it was told to do.
This is what makes replay truthful.
The logic is deterministic.
The inputs are fixed (or their specific historical version is referenced).
The outputs can always be re-evaluated reliably.
But now you might be wondering something else:
Are you seriously planning to replay the entire log every time someone runs a query?
And that brings us to performance — and a different way to think about database state.
4. The Database Is Just a Cache
Let’s say we’ve done everything right so far.
- We have a log of semantic statements.
- We’ve versioned every input.
- We’ve eliminated side effects.
- We can replay any transaction, at any time, and get the same result.
That’s powerful.
But it also sounds… expensive.
If we’re building a system for real workloads, we can’t afford to start from scratch every time someone runs a query.
We need a way to reuse work.
This brings us to a core idea in AprioriDB:
The database isn’t the source of truth.
It’s just a cache.
Replay Is Always Correct — But Not Always Necessary
Let’s be clear about something.
Replay is always available.
You can always rebuild the state of the database at any system time by walking the log and applying each transaction, step by step.
That’s the contract.
But in most cases, you don’t need to do that.
You’ve already run those statements before.
You’ve already seen those results.
And if the inputs haven’t changed, then the outputs haven’t either.
So why compute them again?
Instead, AprioriDB remembers the results of past evaluations — potentially materializing intermediate tables or query results — and uses them to shortcut replay when it’s safe to do so.
This doesn’t break the model.
It accelerates it.
The database becomes a materialized view of the log — cached, incrementally updated where possible, and safely discardable.
A Different Kind of Architecture
This is where AprioriDB starts to diverge from traditional databases.
Most databases treat their current state as canonical. AprioriDB treats it as optional.
The log is canonical. The state is a performance layer.
This means:
- You can delete your materialized database entirely — and rebuild it from the log.
- You can cache intermediate results — safely, deterministically.
- You can store multiple materializations — optimized for different queries, or different timelines.
In this diagram, DB Version 1
and DB Version 2
can be deleted to save storage, and reconstituted at any time.
It’s a bit like Git: the commit history is the truth.
Your working directory is just one possible checkout.
This Architecture Changes What’s Possible
By treating the database as a cache, AprioriDB can:
- Support time travel without the overhead of traditional snapshots,
- Discard and rematerialize partitions or entire datasets on demand, potentially saving storage costs,
- Share intermediate results across queries and across time, improving query efficiency,
- Optimize storage and compute for what’s queried most, not just blindly storing the latest state.
You gain speed — without losing correctness.
You gain flexibility — without giving up determinism.
You gain clarity — because the log is always the ground truth.
It took me years to fully realize: the database wasn’t the truth. It was just a convenience — a cache layered over a deeper reality.
We’ve built something rare: a system that remembers perfectly and replays exactly.
But knowing what happened isn’t the same as understanding why it happened.
In the next section, we’ll pause to reflect on what we’ve gained — and what’s still missing.
5. Reproducibility Is Just the Beginning
Let’s take a breath.
So far, we’ve laid down a system with rock-solid guarantees:
- Every change is recorded, semantically and deterministically.
- Every input is versioned.
- Every output is reproducible.
We can discard our database and rebuild it from scratch — and it will come back exactly as it was.
That’s not normal.
That’s rare.
That’s something to be proud of.
It means we’ve built a database that doesn’t just store facts — it remembers how those facts came to be.
That alone would be enough to fix most of the trust problems in modern data systems.
But it’s not the end of the story.
From What Happened to Why It Happened
Reproducibility means you can see what happened.
But what if you want to ask why?
- Why is this value the way it is?
- Where did it come from?
- What inputs influenced it?
- Which logic path assigned it?
With what we’ve built so far, those answers are technically available — if you’re willing to replay the log, trace every statement, and reason through the outputs manually.
That’s not lineage.
That’s archaeology.
We want more than that.
We’re Going for the Gold: Cell-Level Lineage
We don’t just want to know what the system did.
We want to know why every cell in the system exists in its current form.
That’s lineage — not at the table level, not even at the column level, but at the level of individual values.
It means being able to click on any value and ask:
“Why are you here?”
And get a real answer — one that names its inputs, its expression, its place in the system’s history.
But we’re not there yet.
The Next Step Isn’t What You Expect
You might think the next step would be to add a lineage API.
Or a graph view.
Or some kind of metadata overlay.
But no.
Those would all be external layers — add-ons that try to explain a system after the fact.
We want something deeper.
We want to build a system that can change its mind — that can go back in time and revise its own history in a way that’s still safe, still reproducible, and still auditable.
That means we need to solve a harder problem first.
We need to solve undo.
Undo Isn’t a Button — It’s a Foundation
Undo sounds like a UI feature.
A nice-to-have.
But in a system built on a log, undo is something deeper.
It’s a form of truthful revision. It asks:
- What happens if I want to remove something from the past?
- What if a piece of data was wrong — not logically, but ethically, legally, or contractually?
- Can I erase it in a way that preserves the rest of the system?
Undo is the first sign that we’re working with a living history — not a static archive.
But this isn’t just about elegant architecture.
It’s about what happens at 2am.
When a bad update slips through.
When a mistaken merge gets deployed.
When something breaks and you have no idea what just changed — or why.
If you’ve ever been on-call, you know this moment.
And that’s where we’ll go next.
Interlude: What 2AM Taught Me About Undo
I don’t remember exactly what broke.
It was sometime in the early 2010s.
I was at home.
We didn’t have PagerDuty yet — it was probably Nagios that woke me.
The orchestration was Jenkins, of all things.
The job was ETL.
Of course it was.
I wasn’t a morning person.
I still had sleep in my eyes.
The alert wasn’t gentle.
I don’t remember whether the job itself failed or if a downstream data test triggered afterward.
I don’t remember exactly what I did next.
But I remember the feeling.
Half-awake, half-panicked.
Hoping it wasn’t serious.
Hoping no one important was waiting on the result.
Hoping this wasn’t one of the mornings that started with dread and ended with a data patch.
It wasn’t uncommon back then.
These things happened.
The system was duct-taped together — not because we were sloppy, but because it was the best we had.
I don’t think I got in trouble.
I don’t think the damage was severe.
But I remember the sense that every fix was a gamble.
When you’re bleary-eyed and the pipeline is on fire, your choices are limited:
- You rerun things and hope they land right.
- You monkey-patch tables to make dashboards work.
- You touch production data directly and pray you don’t make it worse.
Undo wasn’t an option.
There was no button.
No protocol.
No clean rollback.
Just the uncomfortable knowledge that the only way to fix the past was to rewrite the present — and carry the mess forward.
I don’t miss those mornings.
And I don’t want anyone else to live in a system where undo is a myth — where correction is dangerous, and trust is brittle.
That’s why undo matters.
Not just for lineage.
Not just for reproducibility.
But for the human beings who carry the burden when things go wrong.
In the next section, we’ll talk about what undo really looks like in a system like AprioriDB — and what it takes to revise the past without breaking the future.
6. Revising the Past Without Breaking the Future
Undo means something different in AprioriDB.
In most systems, undo is a rollback. You revert a change, you bring things “back,” and you try to pretend it never happened.
That’s not how it works here.
AprioriDB doesn’t forget.
Every transaction lives in the log. Every change is part of the story.
But just because we keep history doesn’t mean we’re stuck with it.
Undo, in our model, means:
“I want to revise the past — and preserve the integrity of everything that came after.”
This isn’t just about mistakes. It’s about evolution. Correction. Repair.
We Can’t Just Delete Things
Because the log is the source of truth, we can’t just delete a bad transaction and pretend it was never there.
That would invalidate every transaction that depended on it.
You can’t safely erase the middle of a story unless you also rewrite the ending.
But we don’t want to rewrite the ending. We want to preserve what’s still good, and correct what went wrong, all while keeping the full story intact.
That means undo has to work differently here. It’s not deletion. It’s branching.
Undo as Branch: A Softer Kind of Correction
Undo in AprioriDB is a forward-moving operation.
It adds a new transaction to the log — one that overrides the effects of a previous one.
It doesn’t mutate the past. It doesn’t erase what was done. It acknowledges that the past happened — and then moves forward from it in a new direction.
This creates a new logical branch of the database state. Visually, as we’ll see in the next section, this looks like the main timeline continuing via the log, but with the ‘current’ state effectively pointing back to a pre-mistake database version.
- The original timeline (including the mistake) remains available for historical queries.
- The correction lives on as a new transaction in the log.
- The current state reflects the undo — but the historical state still tells the full truth of what occurred.
This is undo as revision — not as denial.
We Tell the Truth About the In-Between
If someone asks what the system looked like during the period when the bad data was active, AprioriDB can still answer truthfully.
That state was real.
It existed.
Reports were run based on it.
Dashboards reflected it.
People may have acted on it.
We don’t pretend it didn’t happen.
What undo gives us is a way to move forward without corrupting that memory.
It allows us to say:
“Yes, that state was true for a time — but here’s the correction, and here’s when it took effect in the system log.”
Why This Matters
Undo is often thought of as a convenience.
But in log-based systems striving for perfect reproducibility, undo is a foundational operation — a way to preserve trust in a system that evolves over time.
It gives us:
- Safe correction without data loss or history rewriting,
- A clean path forward when mistakes are made, preserving operational stability,
- The ability to inspect the system before and after the correction — without losing either perspective, critical for audits and debugging.
This is the gentler kind of undo — the kind that acknowledges the past, but lets us take responsibility for it through traceable revisions.
In a later section, we’ll explore a stricter form of revision designed not just for correction, but for compliance needs like data removal.
That’s redaction, and it plays by different rules within this framework.
But for now, this is where we start:
Undo that lets us keep moving — without pretending nothing went wrong.
In the next section, we’ll look at what it means to live in a branching world — and how time itself changes shape when we model history this way, illustrated with diagrams.
7. Illustrating Undo
Let’s make the concept of UNDO
visual. We’ll trace the state of the system, its log, and its database versions through a series of simple operations.
The Initial State (Time 1)
We start with an initial state, perhaps after database creation. We’ll call this Time 1
. The first log entry records this initial setup and points to the first database version.
- The
Log
contains one entry (Log Entry 1
). - There is one
Database Version
(DB Version 1
). Log Entry 1
is associated with the creation ofDB Version 1
.
After INSERT (Time 2)
We then make a change: an INSERT
of a new row. This adds a new entry to the log and creates a new database version derived from the previous one. This is Time 2
.
- A new log entry (
Entry 2
) is appended, recording theINSERT
. - A new database version (
DB Version 2
) is created, reflecting the state after the insert. It derives fromDB Version 1
. Entry 2
points to the new versionDB Version 2
it produced.
At this point, we already have basic “time travel”:
- Querying the present (latest log entry
Entry 2
) givesDB Version 2
. - Querying the past (as of
Entry 1
) givesDB Version 1
.
(Note: For time travel queries, the system needs to associate commit timestamps with log entries, allowing queries like “AS OF timestamp X”. This detail isn’t shown in the diagram for simplicity but is crucial for implementation.)
After UPDATE (Time 3)
Next, we perform an UPDATE
on the previously inserted row. This follows the same pattern: append to the log, create a new derived database version. This is Time 3
.
- Log entry
Entry 3
records theUPDATE
. - Database version
DB Version 3
reflects the updated state, derived fromDB Version 2
. Entry 3
points toDB Version 3
.
Now:
- Querying the current state (
Entry 3
) givesDB Version 3
. - Querying as of
Entry 2
givesDB Version 2
. - Querying as of
Entry 1
givesDB Version 1
.
UNDO (Time 4)
Now, the key step: we decide the UPDATE
(recorded in Entry 3
) was a mistake.
- We issue an
UNDO
command targetingEntry 3
. - This adds an
UNDO
entry to the log but crucially does not create a new database version derived fromDB Version 3
. - Instead, it points the system’s current state effectively back to the version before the mistake. This is
Time 4
.
- A new log entry (
Entry 4
) records theUNDO
operation itself. The log remains append-only. - No new database version is created.
- The
UNDO
log entry (Entry 4
) now points toDB Version 2
(dbv2
), which represents the state before the undoneUPDATE
(Entry 3
).
Now, observe the effect on queries:
- Querying the present (the state associated with the latest log entry,
Entry 4
) now returns results based onDB Version 2
— the state we wanted to restore. - Querying the past, as of
Entry 3
, still showsDB Version 3
. TheUPDATE
truly existed at that point in the system’s history, and we haven’t erased that fact.
The UNDO
did not delete history. It added a new entry to the log that changed which database version represents the “current” state, effectively rolling back the effect of the mistaken transaction while preserving the full record of what actually occurred.
Key Takeaways
- Each transaction’s log entry records the operation and is associated with the database version reflecting the state after that operation (except for
UNDO
). - Undo doesn’t mutate the past log entries or database versions — it moves forward by adding a new log entry that points the current state back to an earlier database version.
- The log remains strictly append-only.
- Past states, including mistakes, remain visible via time travel queries — but the system truthfully records the correction and makes the corrected state the current one.
Interlude: What Salesforce Taught Me About Undo
I used to run a data team for a business unit where one of our primary data sources was Salesforce.
We were a major downstream dependency - everyone knew it.
Or at least, they were supposed to.
One year, the Salesforce team upgraded their system.
It included major schema changes - field renames, datatype shifts, the works.
They forgot to tell us.
(How often are data teams forgotten by operational and transaction processing teams?
How often are we assumed to be invisible, invincible, or irrelevant - until something breaks?)
By that time, we’d matured our data warehouse practices.
We partitioned tables.
We kept prior versions online — at least for the expensive core fact tables.
But data marts?
The friendly views that reporting tools used?
Those were still big blobs of current state —
flattened by the limitations of SQL generation engines,
optimized for speed at the cost of traceability.
When did the upgrade happen?
Annual budget season.
Exactly when executives were scrutinizing numbers.
Exactly when accuracy mattered most.
The result?
- Data marts were broken.
- Core warehouse fact tables were mangled — compromises we’d made for performance came back to haunt us.
- Critical dashboards showed numbers no one could trust.
What did we do?
We restored from backup.
We adjusted ETL to accommodate the new Salesforce schema.
We reran everything we could.
It took a week.
A week during which:
- Budget planning for the entire business unit stalled,
- Executives made decisions on bad or missing data,
- Some reports quietly froze in place, never to be reconciled.
And worst of all:
Even after restoring,
even after fixing,
there were discrepancies we couldn’t explain.
Numbers that had been visible during the outage —
but that we could no longer reproduce once “fixed.”
All we could do was shrug,
blame the Salesforce upgrade,
and move on.
Another set of ghosts — quietly dismissed.
What I would have given for real undo.
Not restore-from-backup.
Not brute force overwrite.
Not panic and prayer.
Real undo:
- The ability to surgically rewind just the affected transactions,
- The ability to preserve trustworthy past states,
- The ability to show exactly what had changed, when, and why.
Undo not as denial -
but as truthful revision.
Undo as a way to keep faith with the past -
even when the world forgets you’re there.
In the next section, we’ll look at the deeper implications of this model:
How UNDO
affects our model of time itself — and why AprioriDB needs to think about time in two different dimensions.
8. The Shape of Time
The UNDO
operation we just illustrated raises a serious question.
Did we move forward in time — or backward?
At first glance, it feels like we rewound history.
After all, the UNDO
log entry (Entry 4
) resulted in the system’s current state reflecting an earlier database version (DB Version 2
).
But if you look more closely, you’ll see something important:
We didn’t erase the past.
We didn’t delete the UPDATE
log entry (Entry 3
).
We didn’t pretend that a report generated between the UPDATE
and the UNDO
never happened (it would have been based on DB Version 3
).
The transaction log grew strictly forward.Entry 4
(the UNDO
) is newer than Entry 3
(the UPDATE
).
The history recorded in the log is still strictly append-only.
So in one sense, system time moved forward — simply recording the corrective action.
In another sense, the meaning of the “current state” effectively moved backward to a previous point in the data’s evolution.
This seeming paradox forces us to treat time carefully.
It turns out there isn’t just one kind of time inside AprioriDB.
There are two.
And to model them correctly, we need to introduce a concept called bitemporality.
Bitemporality: Two Clocks Ticking
Bitemporality is a well-known concept in temporal database systems, essential for handling complex histories.
It recognizes that there are two different clocks:
- System Time: When was a fact recorded or transaction committed in the system log? This clock always moves forward.
- Effective Time: When was a fact considered true or valid within the modeled world of the database’s state? This timeline can branch and revisit past states due to corrections like
UNDO
.
In AprioriDB:
- System Time corresponds to the commit timestamp associated with each log
Entry
. - Effective Time relates to the state represented by a specific
DB Version
and its lineage.
These two dimensions of time are related — but crucially distinct.
📝 Example Revisited:
After the
UNDO
at Time 4:
- Querying at a System Time after
Entry 4
was committed, looking at the current state, shows data based onDB Version 2
. The Effective Time perspective of the current database is one where theUPDATE
never occurred (INITIAL
->INSERT
).- Querying at a System Time between
Entry 3
’s commit andEntry 4
’s commit shows data based onDB Version 3
. In that slice of system history, theUPDATE
was effective.
Thus, two perspectives coexist, managed by bitemporality:
- What the system log recorded about the sequence of events (System Time).
- What any given database version’s lineage believes happened to its own state (Effective Time).
Why Bitemporality Matters
Without rigorously managing bitemporality:
- Undo operations might appear to “erase” history, creating irreproducible states and breaking audit trails.
- Queries across different system times could silently diverge without a clear explanation.
- There would be no reliable way to explain precisely how and why data changed, especially when corrections occurred.
With bitemporality built-in:
- Every query operates against a consistent view defined by both System Time and Effective Time.
- We can reason precisely about corrections, branching histories, and even speculative future states.
- We can reliably trace both the story of what happened to the system (the log) — and what the database believed was happening at any point in its evolution (the version lineage).
On Consistency
AprioriDB chooses a strong approach here:
- Every database version represents a globally consistent snapshot of data at a specific point in its effective timeline.
- Queries against the system always receive a coherent view tied to a specific system time and a consistent effective timeline. You never get a mixed or partial view resulting from concurrent operations bleeding into each other inappropriately.
- Every query against the system obeys a single, valid, internally consistent timeline based on the requested system time.
This is unusual.
Many temporal systems, or systems retrofitting time-travel, might relax consistency guarantees (e.g., offering snapshot isolation instead of strict serializability during replay) for performance reasons.
AprioriDB prioritizes verifiable trust. We believe that for use cases involving auditing, compliance, and high-stakes analytics, unwavering consistency must be the foundation.
Performance Considerations
Does this design impose a cost?
Yes — but the cost is placed intentionally:
- Writes incur a modest overhead to maintain the log, version inputs if necessary, and ensure consistency. This is the price of perfect history.
- Reads, especially time-travel queries, can potentially be faster than in traditional systems retrofitting history, because:
- Database versions, once materialized, represent clean, deterministic snapshots.
- Queries operate over these coherent snapshots without needing to manually reconstruct history or deal with complex temporal predicates across changing data.
- Time travel is inherent to the model, not a layer added on top requiring complex diffing or snapshot management during query execution.
For many real-world workloads — analytics, regulatory reporting, data science, financial modeling — this trade-off (slightly more deliberate writes for faster, more trustworthy reads and history analysis) isn’t just acceptable. It’s often necessary.
Forking Time: Branching After Undo
Now, let’s continue our example from Section 7. We are currently in Time 4, where Entry 4
performed an UNDO
pointing the current state back to DB Version 2
.
What happens if we now perform a new data modification, like a DELETE
, targeting the current state? Let’s call this Time 5
.
Look closely at the Database Versions
:
Entry 5
(theDELETE
) occurred afterEntry 4
(theUNDO
) in System Time.- The
DELETE
operation applies logically to the state thatEntry 4
pointed to, which wasDB Version 2
. - Therefore,
Entry 5
produces a new database version,DB Version 3a
, which is derived fromDB Version 2
. - Crucially,
DB Version 3a
exists on a different branch of effective time history thanDB Version 3
(which resulted from theUPDATE
).
Now, we have two distinct futures branching from the same past state (DB Version 2
):
- One future (
DB Version 3
) represents the timeline where theUPDATE
occurred. - Another future (
DB Version 3a
) represents the timeline where theUPDATE
was undone, and then aDELETE
occurred instead.
This is branching effective time.
Note that in the effective timeline leading to DB Version 3a
(DB Version 1
-> DB Version 2
-> DB Version 3a
), the UPDATE
transaction effectively never happened from the perspective of the data’s state evolution along that path.
Trees of Time, Lines of History
This leads to a crucial insight about how AprioriDB models time:
At the system level, considering all possible database versions ever created, the structure of effective time forms a tree.
- Branches can split off from any historical database version point whenever an operation like
UNDO
followed by new modifications occurs. - Alternate potential histories and futures can coexist within the system’s stored versions.
- Branches can split off from any historical database version point whenever an operation like
However, from the perspective of any single database version, its history (walking backward via the
becomes
links) remains strictly linear.- Each database version has exactly one parent version it was derived from.
- Following the lineage backward always traces a straight line back to the initial state.
This distinction is critical:
- The system’s log and version store manage the complexity of branching possibilities (the tree/DAG).
- Any individual state or query result exists within a simple, coherent, linear effective timeline.
It’s how AprioriDB supports powerful concepts like correction, exploration, and branching — without sacrificing logical consistency or creating chaos for users querying the data.
In the next section, we’ll explore how we can harness this branching capability proactively:
How we can use branching effective time not just to fix mistakes, but to stage, validate, and safely publish speculative futures without disrupting the present.
9. Speculative Futures
Undo is powerful. It gives us a way to fix mistakes without corrupting the past.
But wouldn’t it be even better… if we could avoid mistakes in the first place?
Imagine:
- Executing complex transactions, like large ETL jobs or schema migrations, in a dedicated staging environment, completely isolated from production.
- Running extensive validation checks against the results in staging.
- Only publishing those changes atomically to production after we’re fully confident they are correct.
This wouldn’t just save face after a mistake. It would significantly improve service availability and data integrity:
- No database locks are needed on production tables for long-running staging jobs.
- No risk of production users seeing half-finished, inconsistent, or incorrect data during processing.
- No disruption to production reads while complex work happens quietly behind the scenes.
Instead of touching production data directly and risking errors, we work safely off to the side — preparing the future in advance. When everything looks good, we publish the new state — cleanly, instantly, and safely.
Introducing Multiple Databases
To make this possible, AprioriDB elevates the concept of branching history (seen with UNDO
) into a core feature:
A single AprioriDB system can host multiple live, named databases concurrently.
Each named database:
- Can be read from independently.
- Can be written to independently (transactions target a specific named database).
- Can evolve on its own effective timeline by accumulating database versions.
- Can branch off from any existing database version at any point in the system’s history.
This unlocks a natural and safe workflow:
- Branch: Create a new database (e.g.,
Staging
) as a copy of another (e.g.,Production
) at a specific point in time. - Modify: Apply changes (ETL, updates, etc.) to the
Staging
database. This creates new database versions linked only toStaging
. - Validate: Run checks, queries, and comparisons against
Staging
without impactingProduction
. - Publish: If validation passes, atomically update
Production
to point to the final, validated database version created inStaging
.
📝 Note: If you’re familiar with
CLONE
or zero-copy cloning features in systems like Snowflake or ZFS, this might seem similar. However, in AprioriDB, this branching isn’t merely a storage optimization — it’s a first-class citizen deeply integrated with the bitemporal history model and the append-only log. It’s fundamental to how time, versions, and state are managed.
What Is a Database Name vs. a Database Version?
Let’s be precise about two critical concepts:
A Database Version (like
DB Version 1
,DB Version 2
in previous diagrams) is an immutable, structured, consistent snapshot of data. It represents the state resulting from a specific sequence of transactions in an effective timeline. Database Versions form the nodes in the branching tree/DAG of effective time history we saw earlier. They are analogous to commit objects in Git.A Database Name (like
Production
orStaging
) is a mutable label or pointer. At any given point in System Time, this label points to one specific, immutable Database Version. Transactions target a Database Name, and operations likeCREATE DATABASE
orPUBLISH
primarily work by creating or updating these pointers. You can think of the history of where a Database Name likeProduction
has pointed over time as being conceptually similar to Git’sreflog
for a branch likemain
— it tracks the sequence of commits the branch label referred to, even if those commits aren’t direct ancestors.
The system log records the evolution of both the database versions and the state of these database name pointers over System Time.
Illustrating Staging and Publishing
Let’s walk through an example using this pointer concept.
(Note: For clarity, diagrams show the sequential states of the log and the ‘Database Names’ container. Each new log entry corresponds to a new state of the ‘Database Names’ container, reflecting pointer changes or additions.)
Time 1: Initial State
We begin with a single database named Production
. The system is freshly initialized. Entry 1
records this, creating the first Database Names
state (Names @ T1
) where Production
points to the initial DB Version 1
.
Time 2: After an INSERT
into Production
An INSERT
targets the Production
database.Entry 2
records this.
It results in a new DB Version 2
derived from DB Version 1
.
It also leads to a new Database Names
state (Names @ T2
) where the Production
pointer is updated to point to this new version DB Version 2
.
Time 3: Creating the Staging Database
We issue a CREATE DATABASE Staging FROM Production
command. Entry 3
records this.
This operation does not create a new database version. It only creates a new Database Names
state (Names @ T3
) where a new pointer named Staging
is added. This new Staging
pointer is initialized to point to the same database version that Production
currently points to (DB Version 2
).
Now, both Production
and Staging
resolve to the same data state (DB Version 2
), but they are independent pointers ready to diverge.
Time travel is accurate. At Time 2
, Staging
did not yet exist.
Time 4: Making Changes in Staging
We now execute an UPDATE
targeting the Staging
database. Log Entry 4
records this.
This creates a new database version (DB Version 3
) derived from the version Staging
was pointing to (DB Version 2
).
It also leads to a new Database Names
state (Names @ T4
). In Names @ T4
, the Staging
pointer is updated to point to the new DB Version 3
, while the Production
pointer remains unchanged, still pointing to DB Version 2
.
Production
users continue to see the state represented byDB Version 2
.Staging
users (or validation processes) now see the state represented byDB Version 3
.- The two databases have diverged safely.
Time 5: Publishing Staging to Production
After validating the changes in Staging
(which currently points to DB Version 3
), we decide to publish them.
We issue a PUBLISH Production FROM Staging
command. Entry 5
records this.
This operation, like CREATE DATABASE
, does not create a new database version. It only creates a new Database Names
state (Names @ T5
). In Names @ T5
, the Production
pointer is simply updated to point to the same database version that Staging
currently points to (DB Version 3
).
Now, both Production
and Staging
point to DB Version 3
. The changes made in staging are now live in production.
Crucially, this publish operation is instantaneous and atomic.
Why? Because it only involves changing where the Production
label points within the latest Database Names
state — no large-scale data copying or modification is required at the moment of publish. It’s just a metadata update recorded in the log.
No locks were needed on production during the staging work. No intermediate states were exposed. The transition was clean.
Why This Matters
This model of named databases as pointers over a shared, immutable version history unlocks powerful and safe workflows:
Safe ETL/ELT:
- Run complex data transformations or ingestions in a forked
Staging
database. - Perform thorough validation, quality checks, and even schema evolution tests in isolation.
- Compare
Staging
directly againstProduction
if needed. - Publish the validated results to
Production
atomically when ready.
Hotfixes:
- Branch
Production
from an earlier, pre-problem database version into aHotfix
database. - Apply corrections safely in the
Hotfix
branch. - Test the fix thoroughly.
- Publish the corrected version back to
Production
without impacting live traffic reading the (uncorrected) state until the moment of publish.
Scenario Modeling:
- Create multiple branches (
ScenarioA
,ScenarioB
) fromProduction
. - Explore alternative data manipulations or “what-if” analyses in each branch without risking production data.
Continuous Availability:
- Production users stay online, reading from a stable database version.
- Reads remain consistent throughout potentially long-running background processes.
- Complex changes are prepared safely in parallel branches.
The benefits compound:
- Minimal to zero downtime for publishing updates.
- Avoidance of locking contention on production data.
- No accidental exposure of half-finished or unvalidated work.
When you’re ready, publishing validated changes is just a simple, safe, atomic pointer update recorded in the log.
In the next section, we’ll delve into a subtle but critical distinction that arises from this model: the difference between the history of these pointers and database versions (like the reflog
), and the concept of data provenance — understanding the causal flow of data, which is essential for true cell-level lineage.
10. History vs. Provenance: Disentangling Timelines
By now, you’ve seen how AprioriDB can:
- Faithfully record every operation in an append-only log.
- Reconstruct database states from any point in system time.
- Manage corrections gracefully using
UNDO
. - Stage changes safely using named database branches (pointers).
- Publish updates atomically without disruption.
These capabilities rely on carefully managing different kinds of history. But to achieve the ultimate goal – understanding why a specific piece of data has the value it does (cell-level lineage) – we need to make a subtle but critical distinction.
Not all histories track the flow of data itself.
When we talk about “history” in AprioriDB, we’re actually dealing with three distinct timelines, evolving based on the append-only log:
System History (Log Entry Sequence):
- This is the straightforward, linear sequence of Log Entries recorded over System Time.
- It represents what operations the system performed, in what order they were committed.
- This history is strictly causal:
Log Entry <N+1>
happens after and depends on the system state resulting fromLog Entry <N>
.
Database Name History (Pointer History / “Reflog”):
- This tracks the sequence of Database Versions that a specific Database Name (like
Production
orStaging
) pointed to over System Time. - It reflects how labels were assigned and reassigned (e.g., via
INSERT
,UPDATE
,CREATE DATABASE
,PUBLISH
,UNDO
). - As we saw with
PUBLISH
andUNDO
, this history is not necessarily causal from a data derivation perspective. A name can abruptly jump to point to an unrelated or older database version. - This is analogous to Git’s
reflog
: it shows the history of the pointer (branch name), not necessarily a direct line of data ancestry.
- This tracks the sequence of Database Versions that a specific Database Name (like
Database Version History (Data Provenance):
- This traces the lineage of Database Versions themselves, following the
becomes
links back through Effective Time. - It represents the actual flow and transformation of data from one consistent state to the next.
- This history is causal:
Database Version <N+1>
is directly derived by applying a specific data-modifying transaction (likeINSERT
,UPDATE
,DELETE
) to its predecessorDatabase Version <N>
.
- This traces the lineage of Database Versions themselves, following the
Understanding the difference, especially between Name History (#2) and Version History (#3), is one of the keys to unlocking true lineage.
Following the Arrows: Causality Matters
Let’s revisit the diagrams from Section 9, keeping these distinctions in mind.
- The
Log
shows System History:
A linear sequenceEntry 1
-leads-to->Entry 2
-leads-to-> … -leads-to->Entry 5
. Causal. - The
Database Versions
show Data Provenance:DB Version 1
-becomes->DB Version 2
-becomes->DB Version 3
. This path is causal. If we had branched further (like in Section 8’sDB Version 3a
), that would be another causal pathDB Version 1
-becomes->DB Version 2
-becomes->DB Version 3a
. - The
Database Names History
tracks the Database Name History:
FollowingProduction
("Production" @ T1
-becomes->"Production" @ T2
-becomes-> … -becomes->"Production" @ T5
) shows it pointing sequentially toDB Version 1
, thenDB Version 2
,DB Version 2
,DB Version 2
, and finally jumping toDB Version 3
due to thePUBLISH
.
This jump ("Production" @ T4
to"Production" @ T5
) reflects a pointer change, not direct data derivation from theProduction
perspective at Time 4.
Why is causality important?
When history is causal (System History, Version History/Provenance), we can trace origins and dependencies directly. We can confidently determine how data came to be. This is provenance. Provenance provides the raw material needed to construct lineage information (e.g., which source rows contributed to this aggregate value?).
When history is merely sequential but not causal (Database Name History), it tells us what state was associated with a label at a given system time, which is essential for time travel queries (“Show me Production
as of last Tuesday”), but it doesn’t directly tell us the data flow leading to that state in cases involving pointer jumps like PUBLISH
or UNDO
.
Time 6 with History: Undoing the Publish
Let’s add one final step to drive the point home. Suppose we decide the PUBLISH
at Time 5 was premature and issue a command to UNDO
that specific operation (Log Entry 5
).
Log Entry 6
records this UNDO
. Like the UNDO
in Section 7, this operation does not create a new Database Version. It creates a new Database Names
state (Names @ Time 6
) where the Production
pointer ("Production" @ Time 6
) is simply moved back to point to the database version it pointed to before the undone PUBLISH
operation, which was Database Version 2
. The Staging
pointer ("Staging" @ Time 6
) remains unaffected, still pointing to Database Version 3
.
Observe at Time 6:
- System History:
Advanced linearly toEntry 6
. - Database Version History:
Unchanged. No new versions were created by theUNDO
pointer operation. The causal pathsDB Version 1 -> DB Version 2
andDB Version 1 -> DB Version 2 -> DB Version 3
still exist. - Database Name History:
TheProduction
pointer now points toDB Version 2
, whileStaging
still points toDB Version 3
. The history of theProduction
name shows it pointed toDB Version 1
, thenDB Version 2
, thenDB Version 2
, and nowDB Version 2
again. This pointer history is recorded, but the jump fromDB Version 3
back toDB Version 2
isn’t a data derivation step.
What Is History?
When a user asks, “What is the history of Production
?”, the answer depends on context:
- Are they asking for the Name History (the sequence
DB Version 1
,DB Version 2
,DB Version 3
,DB Version 2
that the labelProduction
pointed to over system time)? This is like thereflog
. - Or are they asking for the Data Provenance of the current state of
Production
(which at Time 6 isDB Version 2
, whose provenance isDB Version 1
->DB Version 2
)? This is the causal data flow.
Disambiguating these is essential for accurate lineage and understanding system evolution.
📝 UI/UX Challenge: Getting the user interface and user experience (UI/UX) correct to intuitively expose and navigate these distinct-but-related histories (System Log, Name History/Reflog, Version History/Provenance) is a significant challenge. It requires careful design to avoid overwhelming the user while providing the necessary depth for different use cases (simple time travel vs. deep lineage tracing).
Why It Matters
Most data systems implicitly conflate these different timelines or, more commonly, lose them entirely (especially Name History and Version History). Even advanced systems offering time travel or basic lineage often don’t rigorously separate the history of labels/pointers from the causal provenance of data versions.
Comparison with Project Nessie
For instance, even advanced versioning systems like Project Nessie, which bring Git-like branching and commit semantics to the data lake (akin to AprioriDB’s Database Name and Version histories), primarily focus on the state of branches at specific commit points (effective time). Nessie lacks a first-class, unified System History that records the absolute sequence of all operations (including branch creations, merges, administrative actions) across the entire repository over wall-clock time.
Without this complete System Time perspective integrated with the versioning, reconstructing the absolute sequence of all events that influenced the system state, or rigorously reasoning about causality across complex branch interactions, can become ambiguous.
Comparison with Delta Lake
The biggest difference with Delta Lake is the level of granularity: Delta Lake operates at the table level. I consider this loss of cross-table consistency to be a fatal flaw. Reads are not really reproducible, because there is no central system clock to produce transaction timestamps.
In addition, branches are at the table level also. This greatly limits the ability to do complex scenario modeling.
Finally, name to object mappings (like table names to tables) are not tracked separately. There are no discrete RENAME
events.
Comparison with Iceberg
Apache Iceberg suffers from the same limitations as Delta Lake.
AprioriDB’s explicit modeling and preservation of all three histories is fundamental:
- System History provides the undeniable, append-only record of operations.
- Database Name History enables accurate point-in-time queries (“show
Production
AS OF time T”) even across complex branching and publishing operations, reflecting what label pointed where. - Database Version History (Provenance) provides the causal data flow necessary for true cell-level lineage, answering “Where did this specific value actually come from?”
By keeping these distinctions clean, explicit, and durable, AprioriDB lays the necessary groundwork for achieving high-fidelity traceability, auditable correction chains (like the UNDO
examples), secure and provable data redaction (a future topic), and ultimately, trustworthy data systems.
Interlude: Look How Far We’ve Come
Let’s pause for a moment.
Before we dive even deeper,
before we layer on even more complexity,
let’s recognize what we’ve already built.
This isn’t a toy.
This isn’t just another database with prettier marketing.
We’ve built:
- A system that remembers everything — without loss.
- A system that replays exactly — with no ambiguity.
- A system that undoes safely — without lying about the past.
- A system that branches — staging futures without breaking the present.
We’ve created a foundation that treats truth as a first-class citizen.
We’ve refused to cut corners.
We’ve chosen trust over expediency.
Correctness over convenience.
Honor over half-measures.
Causality is the key to lineage.
We now have a better understanding of causality. Anything that is related to time is related to causality.
We need to continue to get an ironclad model of time. We’re close, but we’re not done yet.
11. Beyond UNDO: Redo and Amendment
UNDO
is a powerful tool, but it’s not the only tool we need.
You can UNDO
a transaction immediately after it’s committed. It only works if you catch the mistake immediately, before performing any other transactions.
Sometimes we don’t realize that we’ve made a mistake until well after the fact.
Maybe it’s been days, or even weeks. But time has marched on. People have since made changes to the data.
Let’s start with the next simplest example: UNDO
of something multiple transactions ago.
Advanced UNDO
We begin with a system that has undergone an INSERT
followed by an UPDATE
.
We realize that the INSERT
was a mistake. We want to UNDO
it. But we’ve already made the UPDATE
, and we want to keep it.
How do we proceed?
Here’s a simplified diagram, showing just the transaction log:
We add a new transaction: UNDO
Entry 2.
The system log now looks like this:
How do we reason about what the inside of the database looks like at this point?
A natural approach is to consider doing two separate UNDO
s, followed by the original UPDATE
.
Let’s extend the diagram to reflect what the inside of the system at each point on the system timeline looks like.
Each of the Histories
reflects the logical sequence of additive-only operations at each system time.
- At
History @ T1
, onlyINITIAL
has occurred. - At
History @ T2
,INITIAL
andINSERT Production
have occurred. - At
History @ T3
,INITIAL
,INSERT Production
, andUPDATE Production
have occurred. - At
History @ T4
, onlyINITIAL
,UPDATE Production
have occurred.
Note how the history at time 4 is equivalent to having removed the INSERT
entirely.
Time travel is preserved. We can still see the state at time 3.
Materializing the state leads us naturally to this forked view of transaction history:
Implicitly, we are creating a fork of the log.
Unlike the top-level system transaction log, UNDO
and other amendments don’t exist in this Amended Log
.
State @ Time 4
is equal to a system that has undergone only INITIAL
and UPDATE
, without INSERT
in between. This is represented by Entry 2a
in the Amended Log
.
Let’s fill in the rest of the diagram:
Note that there is now a Names @ T2a
which has as its history Names @ T1
. The retroactive UNDO
has taken effect.
Amended Log
The creation of an Amended Log
gives us a clear framework for thinking about these operations that attempt to manipulate system time.
Remember that system time is linear. UNDO
operations are added to the end of the transaction log.
This preserves system time travel.
To reason about this, we introduce the notion of first-order transactions vs. second-order meta-transactions.
- First-Order: regular transactions that manipulate the system directly, like
INSERT
,UPDATE
, orDELETE
. - Second-Order: meta-transactions that manipulate system history instead of directly working on the system.
Because these second-order meta-transactions manipulate system history, we introduce an Amended Log
that consists of only first-order regular transactions.
UNDO
, as a second-order operation, manipulates the Amended Log
, as shown in our previous example.
Beyond Bitemporality
At this point, our system is not merely bitemporal. There’s a linear, append-only system timeline, which renders and produces databases with their own effective timelines.
But what is the timeline of database names?
In the previous example, the Production
database name at T3 is part of a Names @ T3
collection, which has as its prior name history the Names @ T2
collection.
But the Production
database name at T4 is part of a Names @ T2a
collection, which has as its prior name history the Names @ T1
collection.
If we ask the question, “what is in the past of Production
at T4,” do we mean what Production
resolves to at system time T3, or are we referring to the history of Names @ T2a
?
Time Within Time
In this case, we have something that is not quite system time and not quite effective time.
We have a branching time dimension that sits between.
We’ve graduated beyond mere bitemporality into a framework and process for thinking about meta-time.
Time is change. Astronomical time is measured by the movement of celestial bodies relative to our position here on Earth. Atomic time is measured by the radioactive decay of reference isotopes. We only know time has passed because something in the world has changed.
AprioriDB, as a virtual world, is no different. Time doesn’t move forward unless a write transaction side-effects the system into a new state.
(We’ll come to the complexities of usages of wall-time, like the NOW()
function, later. A sophisticated model of time is necessary for understanding things like scheduled operations and alerts.)
With bitemporality, we see a clear subordinate relationship between effective time and system time:
Effective time is contained within system time.
What we have exposed is that system time can contain another dimension of time - the vector of change of Database Names History
- which then contains and refers to effective time - the vector of change of DB Versions
.
This setup requires us to craft a formal algorithm of modeling time within time, which can be recursively nested to an arbitrary depth.
Thus far, we haven’t gotten any deeper than DB Versions
as a whole.
But databases contain schemas, which have names, and schemas contain tables, which have names. Each of these have their own vectors of change.
Having a robust model of meta-time at each level allows us to very precisely understand provenance and lineage, especially as we get to cell-level lineage.
Let’s pause here to really take in this conceptual shift.
We’re entering abstract territory that is difficult to illustrate, even with detailed diagrams, because they are rendered in two dimensions.
As we proceed, we’ll go slowly to make sure we each have a firm grasp on the concepts of time, meta-time, change, and change of change in order to get all the way to cell-level lineage.
For now, let’s finish up this discussion of system time manipulation by talking about amendments.
Amendment
All of the same principles that allowed for advanced, retroactive UNDO
also allow for AMEND
.
To amend is to replace a prior operation with a different operation.
For example, we can replace the INSERT
Entry 2
transaction in the previous example with a transaction that consists of a DELETE
followed by a different UPDATE
.
Applying our principle of meta-time to produce a new History
, we can easily figure out what that new system version looks like, in a way that preserves system history for the purposes of time travel along the system time dimension.
12. Merge, Rebase, Cherry-Picking, and Effective Time manipulation
UNDO
and AMEND
all target the system timeline.
They manipulate system time itself by setting up new Histories
that each correspond to Amended Log
entries.
But the familiar Git-like operations of merge, rebase, and cherry-picking don’t manipulate system time.
Instead, these take two effective time branches, in the form of DB Versions
, and attempt to create a new DB Version
that combines all of the effective time history of their inputs.
Git MERGE
Illustrated
Let’s step back and consider what MERGE
means in the context of Git:
- Branch
main
is pointing to commitabc
. - We want to merge branch
develop
into branchmain
. - Branch
develop
is pointing to commitdef
. - Both commits
abc
anddef
branched off at a common effective history point represented by commit123
.
Let’s do the MERGE
:
- Git takes commit
abc
(branchmain
’s current reference) and creates a new commitghi
by taking all of the patches between123
anddef
(branchdevelop
’s current reference) and applying them to commitabc
.
What do we see now?
- Commit
abc
still exists. - Commit
def
still exists. - Only new commits are created.
- The
main
branch now points to commitghi
. - The
develop
branch still exists, and still points to commitdef
.
The word MERGE
isn’t quite right… Merging implies the destruction of the raw ingredients.
But here, we’ve only created new commits.
The Git reflog contains the history of the main
branch reference, which before the MERGE
was pointing to commit abc
.
AprioriDB MERGE
Illustrated
We want to do the exact same thing, except with DB Versions
.
TBD.
13. Transactional Language
Up until now, we’ve taken single statement transactions for granted.
Each statement advances the system timeline by exactly one step.
In a traditional relational database, explicit transactions are available. Before we proceed, we need a clear understanding of how transactions work and are modeled in the system.
Explicit Transactions
The user issues a BEGIN
statement, which does not advance the system timeline. Instead, it creates a little bubble universe where statements can be accumulated. Only with a COMMIT
statement is the bubble universe merged into the system timeline as one atomic step.
[Illustrate transaction with UML diagram, showing how statement time is handled as a fork from the system timeline.]
Concurrent Transactions
Conflict Resolution
Single Statement Transactions
Single statement transactions are created in auto-commit mode when a statement is issued to the system outside of an explicit transaction. Internally, the same bubble universe mechanism is used.
14. UNDO
and AMEND
as Statements
We’ve shown UNDO
and AMEND
as new kinds of statement.
Because they are new and so different, we have to define their precise semantics.
UNDO
as a Single Statement Transaction
UNDO
as a single statement transaction does the obvious thing - it undoes the previous transaction.
It does this by advancing the system timeline by one step, creating a new bubble universe, and then executing the UNDO
in that bubble universe.
What happens if we say UNDO
twice in a row?
Does the second UNDO
undo the first UNDO
? Or does it undo the transaction that came before the transaction UNDO
ne by the first UNDO
?
[Illustrate both alternatives.]
UNDO
inside of an Explicit Transaction
Remember that single statement transactions
But what about UNDO
inside of an explicit transaction?
Well, if the UNDO
is the first statement in the transaction, it reaches into the
In the previous example, we used AMEND
The work ahead — cell-level lineage, causal provenance, secure redaction — will be even deeper.
But what we’ve already accomplished?
It’s rare.
It’s real.
And it’s worth celebrating.
Take a breath.
Feel it.
Because this —
this is the kind of foundation you can build a world on.