Building a Formal Verification Tool from Scratch: Part 2 (The Virtual Machine)

📚 Part 2 • Building a Formal Verification Tool from Scratch

+ 2 more articles in this series

✓ Building a Formal Verification Tool from Scratch: Part 1 (Die Hard Edition)
→ Building a Formal Verification Tool from Scratch: Part 2 (The Virtual Machine)
○ Building a Formal Verification Tool from Scratch: Part 3 (Rethinking the Foundation)

View full series → 🔔 Get notified of new articles

In Part 1, we built a solver that used breadth-first search to systematically generate and evaluate every possible state transition for the water jug puzzle. We defined the rules of the system (fill, empty, pour) and watched the solver discover the solution by exploring all valid sequences of actions. The breakthrough was seeing how the solution emerged from the rules themselves rather than being explicitly programmed.

But we cheated.

We built a Water Jug Solver, not a Model Checker. If we wanted to verify a different system—say, whether a distributed lock manager can deadlock or whether a two-phase commit protocol maintains consistency—we would have to delete our code and rewrite the State struct and the next_states function from scratch.

To build a real formal verification tool, we need an engine that stays exactly the same regardless of the problem we feed it. We need to move from hardcoded logic to interpreted data—exactly how a virtual machine executes bytecode without knowing what program you’ll run on it. This separation between the engine (which stays constant) and the model (which changes) is what makes formal verification powerful, and it’s what we’re building now.¹

In this post, we’re tearing out the hardcoded logic from Part 1 and replacing it with a custom Virtual Machine (VM) implemented in Odin. We’ll discover why we need it, build it in phases, and encounter some subtle traps about memory that can plague systems programmers. Along the way, you’ll see how thinking like a language designer changes the architecture of your code.

Here’s what we’ll cover:

Why hardcoded states won’t scale and how register files solve this
Building an instruction set that can express conditional logic
The difference between value identity and memory identity
How tagged unions let us safely mix types in a statically typed language
Putting it all together: running the same breadth-first search (BFS) engine with interpreted bytecode

A language that doesn’t affect the way you think about programming is not worth knowing. — Alan Perlis

Why Hardcoded States Won’t Scale

To understand why we need a virtual machine, let’s examine what we built in Part 1. Our state representation looked like this:

// The "Hardcoded" Way
State :: struct {
    small: Gallons,
    big:   Gallons,
}

This is fast and readable, but completely inflexible. A generic solver doesn’t know what “small” or “big” means. It doesn’t know if a system has two variables or twenty.

If we wanted to model a different system tomorrow (say, a traffic light controller with states for red, amber, and green lights), we’d need to delete this struct and write a new one:

State :: struct {
    northSouth: LightColor,
    eastWest:   LightColor,
    timer:      int,
}

And then rewrite the entire next_states function, the solver, and everything else that touches the state.

This defeats the entire purpose. TLA+ doesn’t require you to recompile the checker for every new specification you write. Neither should we.

Making It Generic: The Register File

The solution is to replace named fields with numbered slots, a pattern borrowed from CPU architecture. Instead of a struct with small and big fields, we’ll use an array where slot 0 holds the small jug’s value and slot 1 holds the big jug’s value. The solver doesn’t need to understand what these slots represent; it just executes operations on whatever data they contain.

In a CPU, you don’t have variables named “counter” or “userAge”. You have numbered registers: R0, R1, R2. Programs reference these registers by index, not by name. The CPU doesn’t care what you’re storing, it just knows “read from R3, add 5, write to R7.”²

We can do the same thing. Instead of named fields like small and big, we use a flat array of numbered slots. Slot 0 is the small jug. Slot 1 is the big jug. If we were modelling a distributed system, slot 0 might be “ServerA’s timestamp” and slot 1 might be “ServerB’s lock status”.

The VM doesn’t care. It just sees slots.

But here’s the problem: what type should those slots hold? In TLA+, a variable can be an integer, a boolean, or a set. In a statically typed language like Odin, we can’t just throw mixed types into an array and hope for the best.

This is where tagged unions come in.

Tagged Unions: Safe Type Mixing

A tagged union (also called a discriminated union) lets us store different types in the same memory location whilst remembering which type is currently stored. Think of it as a typed box: it can hold either an integer or a boolean, and there’s a label telling you which one is inside.

// The Atomic Unit of Data
Value :: union {
    int,
    bool,
}

When you store a value, the union “tags” it: “This is an int.” When you read it back, you can check the tag before using it. If you try to treat a boolean as an integer, the program crashes with a clear error instead of silently corrupting memory.

Now we can build our generic state:

// The Generic State
State :: struct {
    registers: [dynamic]Value,
}

The [dynamic]Value syntax means we’re creating a dynamically-sized array (it can grow or shrink as needed) where each element is of type Value. This gives us the flexibility to model systems with two variables or two hundred.

With this single change (replacing named fields with a typed register file), our solver becomes universal. The water jug problem needs two registers. A distributed consensus algorithm might need twenty. The solver doesn’t care; it executes operations on whatever registers exist.

Representing State Changes: The Instruction Set

We have a generic way to represent state (the register file), but we still need a generic way to represent changes to that state.

In Part 1, we hardcoded the changes:

// Hardcoded: Fill the big jug
append(&states, State{s.small, Gallons(BIG_JUG_CAP)})

This won’t work anymore. Our VM doesn’t know what “BIG_JUG_CAP” means. It doesn’t know there’s a “big jug” at all.

We need to express the same idea using only register indices and generic operations:

Human version: “Set the big jug to 5 gallons”
VM version: “Set register 1 to the value 5”

That second form is an instruction, a generic command that works regardless of what the registers represent. It’s the fundamental unit of change in our system. We need a way to define and execute instructions.

Building the VM: First Attempt

We started simple with three instructions: ASSIGN, ADD, SUB. That’s enough to move numbers around and do basic arithmetic.

OpCode :: enum {
    ASSIGN, // register[target] = constant
    ADD,    // register[target] = register[source] + constant
    SUB,    // register[target] = register[source] - constant
}

Instruction :: struct {
    op:     OpCode,
    target: int,
    source: int,
    amount: int,
}

The execute function is straightforward: take a state pointer and an instruction, modify the state.

execute :: proc(s: ^State, instruction: Instruction) {
    switch instruction.op {
    case .ASSIGN:
        s.registers[instruction.target] = instruction.amount
    case .ADD:
        val := s.registers[instruction.source].(int)
        s.registers[instruction.target] = val + instruction.amount
    case .SUB:
        val := s.registers[instruction.source].(int)
        s.registers[instruction.target] = val - instruction.amount
    }
}

That .(int) syntax extracts the integer from the tagged union. If the value isn’t actually an integer, the program crashes with a clear error rather than silently corrupting memory.³

I wrote a quick test to verify the implementation so far: set register 0 to 3, set register 1 to 5.

state := State{ make([dynamic]Value, 2) }
execute(&state, Instruction{ .ASSIGN, 0, 0, 3 })
execute(&state, Instruction{ .ASSIGN, 1, 0, 5 })

Running it produced exactly what I expected:

Initial: [ 0, 0 ]
Running the program...
Final:   [ 3, 5 ]

It worked, and I could see the state changing as expected.⁴

I stared at the solution for a moment, feeling satisfied before worrying about the next step.

The Problem: It Just Follows Orders

This VM can execute instructions, but it can’t express conditional logic: the “only do this if the state allows it” rules that make state transitions meaningful.

Here’s what it does: you give it instruction 1, it executes. You give it instruction 2, it executes. It blindly follows commands. There’s no way to say “only fill the small jug if it’s not already full.” There’s no way to express conditional logic.

Remember Part 1’s next_states function? It had implicit conditionals baked into the logic:

// Rule 1: We could fill the small jug
append(&states, State{Gallons(SMALL_JUG_CAP), s.big})

We’re really saying: “IF the small jug isn’t full, THEN fill it.” But we never wrote that explicitly. The rule only made sense when the jug wasn’t already full, so we just didn’t call it in those cases.

In bytecode, there’s no if statement. There’s just a sequence of instructions. If we translate “Fill Small Jug” directly into our current instruction set, we get:

{ .ASSIGN, 0, 0, 3 }  // Set register 0 to 3

That will always set the jug to 3, even if it’s already full. Even if it already contains 3. That’s not a rule about what’s possible. That’s just a command to blindly execute.

The breadth-first search (BFS) engine from Part 1 could explore possibilities, but this VM can’t generate them. It can’t ask questions about the state. It can’t decide whether an action is valid.

We need a way for instructions to fail.

The Breakthrough: Instructions That Can Fail

In TLA, every action has two parts: guards (preconditions that must pass) and effects (the changes that happen if guards succeed). An action like “withdraw money from ATM” has a guard: balance > withdrawal amount. If that check fails, the action doesn’t execute. You don’t end up with negative money in your account.

This isn’t unique to TLA. It’s fundamental to how we think about state machines. We need our VM to work the same way.

So let’s add three new opcodes:

OpCode :: enum {
    // Guards (Checkers) - If these fail, the action aborts
    LT,     // Less Than
    EQ,     // Equal
    NEQ,    // Not Equal

    // Effects (Mutators) - Change the state
    ASSIGN,
    ADD,
    SUB,
}

Now the execute function could return a boolean: true means “keep going,” false means “abort this action.”

execute :: proc(s: ^State, instruction: Instruction) -> bool {
    switch instruction.op {
    case .LT:
        val := s.registers[instruction.target].(int)
        if !(val < instruction.amount) do return false

    case .ASSIGN:
        s.registers[instruction.target] = instruction.amount

    // ... other opcodes
    }
    return true
}

Now we could translate the rule “Fill the Small Jug” into bytecode that actually makes sense:

Human Rule: “If the small jug holds less than 3 gallons, make it hold 3 gallons.”

Bytecode:

make_action("FillSmall", []Instruction{
    { .LT,     0, 0, 3 }, // Guard: Is register 0 less than 3?
    { .ASSIGN, 0, 0, 3 }, // Effect: Set register 0 to 3
})

The BFS loop from Part 1 could now try every action. If the guard failed, that action was impossible from this state. If the guard passed, the effect would execute, creating a new state.

We have successfully separated the model (the list of instructions) from the engine (the code that runs them).

One More Thing: The POUR Intrinsic

To handle the water jug pouring logic cleanly, we added one more opcode:

POUR,   // Pour from source → target, capped by capacity

The implementation calculates how much water can actually be transferred (the minimum of what’s in the source jug and how much space is in the destination):

case .POUR:
    value_source := s.registers[instruction.source].(int)
    value_target := s.registers[instruction.target].(int)

    space_in_target := instruction.amount - value_target
    transfer := min(value_source, space_in_target)

    s.registers[instruction.source] = value_source - transfer
    s.registers[instruction.target] = value_target + transfer

This handles the water transfer calculation cleanly as a built-in operation.⁵

The Design Decision: Fingerprinting States

Before we run this new VM with the BFS solver, we have to solve a problem I’ve been thinking about since we started building with dynamic arrays.

In Part 1, our State was a simple struct of two integers. If you generated two states both containing [0, 0], Odin treated them as equal. Simple. The solver could detect “I’ve been here before” by checking if that state existed in the visited map.

But now our State contains a [dynamic]Value. Internally, a dynamic array is a struct containing three things: a pointer to the data, a length, and a capacity:

// This is roughly what [dynamic]Value looks like internally
struct {
    data: *Value,  // Pointer to memory
    len:  int,
    cap:  int,
}

The trap here is that if we generate state S1 with contents [0, 0] and later generate state S2 with contents [0, 0], they will allocate their data at different memory addresses.

S1.data points to address 0xAAA
S2.data points to address 0xBBB

If we try to use State directly as a key in our visited map, Odin compares the raw struct fields. Since our state contains a dynamic array (which internally holds a pointer to memory), the map compares the addresses where the data lives, not the values inside.

The map says 0xAAA ≠ 0xBBB, these are different states.

But they’re not different. They both contain [0, 0]. The solver fails to detect that we’ve been here before. It goes into an infinite loop, re-exploring the same states over and over, slowly consuming all available memory until the operating system kills the process.

This is the distinction between value identity and memory identity. We care about what’s in the jugs, not where the jug data lives in RAM.

The Solution: Hashing Contents

The solution is called fingerprinting.⁶ We walk through the register array and hash the actual contents into a single number—a 64-bit fingerprint that uniquely represents “what’s in the state.”

Fingerprint :: u64

fingerprint :: proc(s: State) -> Fingerprint {
    h := u64(0xcbf29ce484222325) // FNV-1a offset basis

    for v in s.registers {
        switch val in v {
        case int:
            x := u64(val)
            h = (h ~ x) * 0x100000001b3
        case bool:
            x := u64(val ? 1 : 0)
            h = (h ~ x) * 0x100000001b3
        }
    }
    return Fingerprint(h)
}

For now, we’re using FNV-1a, a fast, simple hash function that mixes bits well to avoid collisions.⁷

Now, instead of storing map[State]State, we store map[Fingerprint]State. Even if two states live at different addresses in memory, if they contain [0, 0], they generate the same fingerprint, and the cycle is detected. This is one of those subtle bugs that’s invisible until it destroys your program. By thinking through the memory model upfront, we avoid hours of debugging later.

Putting It All Together

With the data structures in place, the core of our tool becomes remarkably simple. The BFS loop from Part 1 stays almost identical. We just swap out State for Fingerprint in the maps:

// Core BFS remains the same
visited_hashes := make(map[Fingerprint]Fingerprint)
state_storage := make(map[Fingerprint]State)

start_hash := fingerprint(initial_state)
visited_hashes[start_hash] = start_hash
state_storage[start_hash] = initial_state

queue := make([dynamic]Fingerprint)
append(&queue, start_hash)
for len(queue) > 0 {
    current_hash := queue[0]
    ordered_remove(&queue, 0)
    current_state := state_storage[current_hash]

    // Check if we've reached the goal
    if current_state.registers[BIG_REG].(int) == TARGET {
        // Found it!
        break
    }

    // Try every action
    for action in actions {
        next_state := clone_state(current_state)
        possible := true

        // Run all instructions
        for instruction in action.code {
            if !execute(&next_state, instruction) {
                possible = false
                break
            }
        }

        if possible {
            h := fingerprint(next_state)
            if h not in visited_hashes {
                visited_hashes[h] = current_hash
                state_storage[h] = next_state
                append(&queue, h)
            }
        }
    }
}

For each state, we try every action. The VM executes the instructions and if any guard fails, we abort that action and try the next one. If all instructions succeed, we’ve generated a new valid state.⁸

Notice what’s absent from this loop: there’s no logic specific to water jugs. No mention of gallons, capacities, or pouring. The actions are just data being fed into the engine. This is the transformation we were aiming for.

Time For a Test Drive

When I ran the new Generic Solver, I got exactly the same output as in Part 1:

Generic Solver Started (Optimised)...
SOLVED!
[ 0, 0 ]
   |
   v
[ 0, 5 ]
   |
   v
[ 3, 2 ]
   |
   v
[ 0, 2 ]
   |
   v
[ 2, 0 ]
   |
   v
[ 2, 5 ]
   |
   v
[ 3, 4 ]

Same solution. Same path. Same 7 steps to solve the puzzle.

But look at what changed in the code.

The only place water jugs exist in this codebase is in the actions array, the data we feed into the VM:

actions := [?]Action {
    make_action("FillSmall", []Instruction{
        {.LT, SMALL_REG, 0, SMALL_JUG_CAP},
        {.ASSIGN, SMALL_REG, 0, SMALL_JUG_CAP},
    }),
    make_action("FillBig", []Instruction{
        {.LT, BIG_REG, 0, BIG_JUG_CAP},
        {.ASSIGN, BIG_REG, 0, BIG_JUG_CAP},
    }),
    // ... 4 more actions
}

Everything else (the BFS loop, the execute function, the fingerprinting logic) is completely generic.

Tomorrow, we could delete those six actions and replace them with actions for a distributed lock manager. We could model a two-phase commit protocol, and the engine would stay the same.

We’ve successfully separated the what from the how. The water jug problem isn’t hardcoded in procedures anymore. It’s just data flowing through a general-purpose engine.

This is what Leslie Lamport meant when he said to “think above the code”. We’re not writing solutions anymore. We’re defining problems and letting the engine discover solutions or find bugs.

What’s Next

We have a powerful backend, but constructing these models manually is tedious. Writing {.LT, 0, 0, 3}, {.ASSIGN, 0, 0, 3} for every action isn’t user-friendly. Imagine trying to model a Raft consensus algorithm like this. You’d need hundreds of lines of bytecode, and it would be nearly impossible to read or maintain.

In Part 3, we’ll build the compiler. We’ll design a custom Domain-Specific Language (DSL) for TLA, write a lexer and parser, and teach our tool to translate high-level text into the bytecode we defined today.

Instead of this bytecode:

make_action("FillSmall", []Instruction{
    { .LT, 0, 0, 3 },      // Guard: register 0 less than 3?
    { .ASSIGN, 0, 0, 3 },  // Effect: set register 0 to 3
}),

We’ll write readable specifications:

action FillSmall {
    when small < 3
    then small := 3
}

The compiler will handle the translation—mapping small to register 0, converting < to the .LT opcode, and transforming := into .ASSIGN instructions. That’s when this stops being an experiment and starts being a tool you’d actually want to use.

You can check out the full code for this article on GitHub:

Complete VM implementation with guards and fingerprinting

If you’re enjoying this deep dive into systems programming and formal methods, subscribe below to follow along as we build the compiler next. Between articles, I share shorter updates and experiments on Twitter. Much thanks to my friends Eze and Ola, for their feedback and review on this article

Notes & References

Think of the Java Virtual Machine. It executes Java bytecode without knowing whether you’re running a banking system or an Android app. The JVM stays the same; only the bytecode changes. WebAssembly works in a similar way. Browsers execute Wasm bytecode from any language (Rust, C++, Go) without modification. We’re building a similar architecture for formal verification. ↩
In CPU architecture, the register file is the array of registers that temporarily hold data during processing. ↩
This is Odin’s type assertion syntax for safely extracting values from unions. ↩
You can check out the complete code for this simple VM on GitHub. ↩
In a mature compiler, POUR could be decomposed into MIN, ADD, and SUB instructions. We’re treating it as an intrinsic for now to keep the VM simple. We will tackle instruction decomposition in a future post when we build the compiler. ↩
In computing, fingerprinting is the technique of creating a compact, unique representation of a larger data. Just as human fingerprints uniquely identify people, data fingerprints (hashes) uniquely identify data structures. It’s used everywhere from detecting duplicate files to verifying downloaded software hasn’t been tampered with. ↩
The “magic numbers” in the code (0xcbf29ce484222325 and 0x100000001b3) are mathematical constants from the FNV algorithm. ↩
See the complete BFS implementation with fingerprinting and state storage. Note that we use core:container/small_array to store actions in fixed-size stack arrays (16 instructions max) rather than heap-allocating. This improves cache locality—the CPU can fetch the whole instruction list in one go—and removes allocation overhead. When executing millions of instructions, it adds up. If an action needs more than 16 instructions, we’d increase the size or switch to dynamic arrays. ↩