Building a Formal Verification Tool from Scratch: Part 3 (Rethinking the Foundation)

📚 Part 3 • Building a Formal Verification Tool from Scratch

+ 2 more articles in this series

✓ Building a Formal Verification Tool from Scratch: Part 1 (Die Hard Edition)
✓ Building a Formal Verification Tool from Scratch: Part 2 (The Virtual Machine)
→ Building a Formal Verification Tool from Scratch: Part 3 (Rethinking the Foundation)

View full series → 🔔 Get notified of new articles

In Part 2, we successfully built a Virtual Machine. It could execute instructions, handle conditional guards, and solve the Water Jug problem by exploring the state space. It was logically correct, flexible, and easy to read.

Then I sat down to start building the compiler.

I looked at the code and realised my default choice of dynamic arrays and tagged unions could lead to performance pitfalls or complex logic later. Every time we cloned a state, we allocated new heap memory. Every time we compared two states, we chased pointers and checked types. Every time we popped from the queue, we shifted an array. Some of those were deliberate choices to get us moving.

So before we go further, we’re going to refactor. We’re going to revisit the shortcuts we took in Part 2, apply some lessons I learned from watching a talk on Data-Oriented Design (DOD)¹, and establish clearer constraints before the complexity compounds.

Why Refactor Now?

You might wonder: if the current code works, why change it?

Because complexity is cumulative.

Right now, we have a working VM. Next, we’re adding a compiler (lexer, parser, type checker). After that, we’ll eventually add optimisations (state caching, symmetry reduction). Each of these builds on the foundation we’ve laid. If that foundation is needlessly complex—pointer chasing, runtime type checking, $O(N)$ operations lurking in loops—every layer we add could make debugging harder, and the program slow.

By simplifying now, we make future implementation (and articles) easier to write and understand. The VM becomes something you can hold entirely in your head.

Think about the data layout and transformation, and the code often simplifies itself. — core tenet of Data-Oriented Programming

The Problem: A Web of Pointers

Let’s start with the State data structure we from Part 2:

Value :: union { int, bool }

State :: struct {
    registers: [dynamic]Value,
}

This works, but every time we interact with state, we’re dealing with three layers of indirection:

The State struct contains a pointer to the array data (registers)
The array contains tagged unions (Value).
Each union needs a type check before we can use the value (e.g. registers[0].(int))

When I clone a state, I need to allocate new heap memory. When I compare two states, I need to walk through pointers and check types. These tiny allocations are wasted time in a solver running thousands or millions of iterations.²

Furthermore, every time we generated a new state in our solver, we cloned this array, which meant asking the memory allocator for a new chunk of heap space. In a loop exploring thousands of states, that’s thousands of trips to the allocator, leading to memory fragmentation that looks somewhat like this:

Heap (fragmented):
[State]    [State]        [State]  [State]     [State]
   ↓          ↓              ↓        ↓           ↓
  0x100     0x250          0x4A0    0x5E0       0x7C0

Each allocation lives at a random address.
Each jump to a random address risks a "cache miss",
stalling the processor while it fetches data from RAM.

What I want is something simple and predictable:

Heap (contiguous):
[State][State][State][State][State]
   ↓
  0x100

All states live in sequence. The CPU can pre-fetch the data it needs.
It is predictable and easier to reason about.

The Solution: Fixed-Size Registers

Here’s what I learned from that Data-Oriented Design talk: data structures should match their actual usage patterns, not theoretical possibilities.

Our VM doesn’t need registers that can grow to any size. It needs a fixed number of slots that hold integers or booleans. So let’s just… make that explicit:

REGISTER_SIZE :: 8

State :: struct {
    registers: [REGISTER_SIZE]i64,
}

We’ve made three deliberate choices here:

Fixed Size: We replaced [dynamic] with [8].³
Raw Integers: We replaced the Value union with i64.
No Pointers: This struct contains no pointers. It is just Plain Old Data (POD).

This is simpler in every way:

No pointers. The array is embedded directly in the struct. No heap allocation, no indirection. When you have a State, you have the actual data.

No type tags. We use i64 for everything. A boolean is just 0 or 1. This removes an entire layer of runtime type checking.

Fixed size. If you need to clone a state, you copy 64 bytes. If you need to compare two states, you compare 64 bytes. No walking through dynamic structures.

Why 8 registers? An i64 is 8 bytes, so $8 \text{ registers} \times 8 \text{ bytes} = 64 \text{ bytes}$ . This happens to fit perfectly into a standard CPU cache line,⁴ but more importantly, it’s easier to understand. A State is just 8 numbers. That’s it.

Dropping Tagged Union for Raw Bytes

You might notice we’re using i64 for everything—integers, booleans, and whatever else we might need. Isn’t this throwing away type safety?

I don’t think so. The VM shouldn’t be the layer that enforces types. That’s the compiler’s job.

The VM is a dumb, fast engine. It moves bytes between registers. Just like a real CPU doesn’t know if a value in a register is a character or a number; it just operates on bits. The compiler will ensure we never try to add a boolean to an integer, or pour water into a non-existent jug. When the compiler generates bytecode, it already verified that the types are correct.

This separation of concerns means:

The VM stays simple and debuggable
The compiler handles all the complex type checking
We can easily inspect state during debugging since it’s just bytes

We’re trading runtime type safety for compile-time type safety. Since we’re building both systems, this is a deliberate choice.⁵

The most significant benefit, however, is value semantics. Unlike JavaScript, fixed arrays in Odin are values, not references. When we clone a state now, we’re simply copying 64 bytes on the stack:

// This is a memory copy of 64 bytes.
// No allocator. No heap. No pointers.
next_state := current_state

Rearranging Instruction Fields for Memory Efficiency

While refactoring State, I noticed the Instruction struct was loosely organised. Fields added as we thought of them:

Instruction :: struct {
    op:     OpCode,
    target: int,
    source: int,
    amount: int,
}

The Data-Oriented Design video mentioned that struct field order matters. Not just for performance (though that helps), but for memory layout efficiency. The way you arrange fields affects how the compiler pads the struct to meet alignment requirements.

Here’s what the old layout probably looks like in memory (assuming 64-bit architecture):

Old Layout (from Part 2):
┌─────────┬─────────────┬─────────────────┬─────────┬─────────┬─────────┐
│ OpCode  │ pad (empty) │ target (int)    │ source  │ amount  │ TOTAL   │
│ 1 byte  │ 7 bytes     │ 8 bytes         │ 8 bytes │ 8 bytes │ 32 bytes│
└─────────┴─────────────┴─────────────────┴─────────┴─────────┴─────────┘
          ↑ wasted space to align the next 8-byte integer field

The compiler has to insert 7 bytes of padding after the 1-byte OpCode because int requires 8-byte alignment. We’re wasting 7 bytes per instruction.

I rearranged it to minimise waste:

OpCode :: enum u8 { ... }

Instruction :: struct {
    amount: i64,    // 8 bytes
    op:     OpCode, // 1 byte
    target: u8,     // 1 byte
    source: u8,     // 1 byte
    // 5 bytes of implicit padding
}

New layout in memory:

New Layout:
┌─────────────────┬─────────┬────────┬────────┬───────┬─────────┐
│ amount (i64)    │ OpCode  │ target │ source │ pad   │ TOTAL   │
│ 8 bytes         │ 1 byte  │ 1 byte │ 1 byte │ 5     │ 16 bytes│
└─────────────────┴─────────┴────────┴────────┴───────┴─────────┘
                                               ↑ padding at the end

By putting the largest field first, then packing the smaller fields together, we:

Halved the struct size (32 bytes → 16 bytes)
Moved all padding to the end
Made the layout intentional rather than accidental

Now the struct is exactly 16 bytes.⁶ Additionally, 16 bytes being a power of two means:

Alignment: The compiler can optimise array indexing using bit-shifts (index << 4) rather than multiplication instructions
Cache Density: A block of 4 instructions fits perfectly into a single 64-byte cache line:

Cache Line (64 bytes):
┌────────────────┬────────────────┬────────────────┬────────────────┐
│ Instruction 1  │ Instruction 2  │ Instruction 3  │ Instruction 4  │
│ 16 bytes       │ 16 bytes       │ 16 bytes       │ 16 bytes       │
└────────────────┴────────────────┴────────────────┴────────────────┘

Upgrading the Hash Function

I also revisited the fingerprinting function from Part 2. We used FNV-1a, a simple hash function that worked fine for getting started. But now that we refactoring the current code, it seemed like a good time to upgrade to a better algorithm.

I recently discovered the xxHash, a fast non-cryptographic hash algorithm⁷. I saw the benchmark results in the SMHasher suite and the numbers were impressive. Fortunately, Odin has the different variants of xxHash in the core library and they use SIMD too.⁸

Given that our State is a solid block of 64 bytes with no pointers, we can take advantage of this:

import "core:hash/xxhash"
import "core:mem"

Fingerprint :: u64

fingerprint :: proc(s: ^State) -> Fingerprint {
    // Treat the struct as a raw byte slice
    data := mem.ptr_to_bytes(s)

    // Hash the bytes the XXH3 variant
    return xxhash.XXH3_64(data)
}

We point the hash function at the struct and say “hash this.” No iteration, no type checking, no branches. This is what happens when your data structure matches your algorithm. The code becomes obvious, and better algorithms become easier to adopt.

Upgrading the Queue

The last thing I wanted to address was a shortcut from Part 1. We used a dynamic array and ordered_remove(&queue, 0) to pop states from our breadth-first search queue:

current := queue[0]
ordered_remove(&queue, 0)  // Shifts all elements left by one

This is fine for the water jug puzzle (7 states total). But I know that real verification problems can generate a lot more states. If we wait until then to fix this, we’ll be debugging “why is my solver slow?” when the answer is “you’re shifting a 100,000-element array on every iteration.”

In this dynamic array, ordered_remove() works by removing the first item and shifting every other item down by one slot. If the queue has 10,000 states, every step of the solver performs 9,999 copy operations. This is an $O(N)$ operation inside our main loop, turning what should be a linear solver into an accidentally quadratic one.

It’s better to use the right data structure now. Odin’s standard library provides core:container/queue, which implements a ring buffer.⁹ This allows us to push to the back and pop from the front in $O(1)$ constant time:

import "core:container/queue"

// ... inside main ...
bfs_queue: queue.Queue(Fingerprint)
queue.init(&bfs_queue)
defer queue.destroy(&bfs_queue)

// O(1) operation to enqueue
current_hash := queue.pop_front(&bfs_queue)

The code looks nearly identical, but we’ve removed a future bottleneck before it becomes a problem.

The Refactored Engine: Same Logic, Clearer Design

We haven’t changed the logic of our Water Jug solver. It still outputs the exact same solution steps. But we have fundamentally changed how we think about the data:

State: Fits in a cache line, no pointers, no indirection
Allocation: Zero allocation in the hot loop
Hashing: Upgraded to xxHash with SIMD support
Queue: Constant time operations with ring buffer

Here is the logic for the new execute procedure. Notice that we no longer need type assertions (val.(int)). We’re operating on raw bytes:

execute :: proc(s: ^State, instruction: Instruction) -> bool {
    switch instruction.op {
    case .LT:
        // Direct array access, no pointers
        if !(s.registers[instruction.target] < instruction.amount) {
            return false
        }
    case .ASSIGN:
        s.registers[instruction.target] = instruction.amount
    case .ADD:
        s.registers[instruction.target] += instruction.amount
    // ...
    }
    return true
}

We now have a high-performance engine. It’s raw, it’s fast, and it has absolutely no safety rails. If we access registers[9], we crash. If we treat a boolean as an integer, the VM happily obliges.

But this is exactly where I want us to be. The engine should be fast and minimal; safety is the job of the compiler.

What’s Next

Now we can build the compiler on solid ground. In the next article, we’ll finally build that safety layer. We’re not just building a parser, we’re designing a language that makes formal verification feel natural. We need syntax that is familiar yet expressive for TLA, error messages that guide, and a type system that catches bugs before they reach the VM.

Remember that unsafe registers[9] access? The compiler will catch it at parse time, not runtime. We’re going to build a language where the obvious code is also the correct code… hopefully 😉.

Instead of writing bytecode like this:

make_action("FillSmall", []Instruction{
    { .LT, 0, 0, 3 },      // Guard: register 0 less than 3?
    { .ASSIGN, 0, 0, 3 },  // Effect: set register 0 to 3
}),

We might write readable specifications like this:

action FillSmall {
    when small < 3
    then small := 3
}

The compiler will handle the translation—mapping small to register 0, converting < to the .LT opcode, and transforming := into .ASSIGN instructions. That’s when this stops being an experiment and starts being a tool you’d actually want to use.

You can check out the complete code for this article on GitHub

If you’re enjoying this deep dive into systems programming and formal methods, subscribe below to follow along as we build the compiler next. Between articles, I share shorter updates and experiments on Twitter.

Notes & References

Data-Oriented Design is a programming paradigm that prioritises the layout and access patterns of data to match how CPUs actually work. Rather than organising code around objects and inheritance (Object-Oriented Design), DOD organises data for efficient processing and minimal memory indirection. Mike Acton’s CppCon talk “Data-Oriented Design and C++” is the canonical introduction to these ideas. There are more resources on this link: Data-Oriented Design Resources. ↩
“Pointer chasing” refers to the pattern where accessing your data requires following multiple pointers through memory. For example: you have a struct → it contains a pointer to an array → the array contains pointers to objects → those objects contain your actual data. Each arrow is a memory fetch that could miss the cache. The more pointers you chase, the slower your code becomes. ↩
I am currently prototyping with a constant REGISTER_SIZE :: 8, but I am considering making the State generic: State :: struct($N: int). This could allow the compiler to generate specialised code for models of different sizes (e.g., State(16) or State(32)). I’m still weighing the trade-offs, but since our Instruction struct uses u8 for indexing, we know the hard limit will be 256 registers regardless of how we implement the struct. ↩
A cache line is the smallest unit of data transfer between main memory and the CPU cache. On most modern processors, this is 64 bytes. When you access a single byte of memory, the CPU automatically loads the entire 64-byte block containing that byte into the cache, betting that you’ll need the neighbouring bytes soon (spatial locality). If your data structure fits in one cache line, accessing any part of it loads the whole thing. ↩
Why i64 instead of u64? In formal verification, detecting underflow (e.g., a bank balance going below zero) is critical. Signed integers allow us to detect negative values in our invariants, whereas unsigned integers would wrap around to a large positive number, potentially hiding bugs. ↩
TLA itself doesn’t enforce variable limits, but in practice, specifications with hundreds of variables are often poorly designed. Leslie Lamport advocates for modelling abstractions (e.g., “a server is either leader or follower”) rather than concrete state (e.g., “server_1_vote_count = 42, server_2_vote_count = 38…”). If you’re going to hit the 256 register limit, you’re probably modelling at the wrong level of abstraction. ↩
xxHash xxHash is an extremely fast non-cryptographic hash algorithm. It’s used in production by companies like Netflix, Microsoft, and the Linux kernel for hash tables and checksums where you need fast hashing but don’t need cryptographic guarantees. Here’s the link to its proposal on IETF: xxHash fast digest algorithm. ↩
SIMD (Single Instruction, Multiple Data) allows the CPU to process multiple values in parallel using a single instruction. Modern x86 CPUs support SIMD instruction sets (like SSE & AVX) that operate on wide registers. When the compiler knows the input is a fixed-size block (e.g., 64 bytes), it can unroll loops and use wide loads, and in some cases auto-vectorize the hashing logic to process multiple parts of the block per instruction rather than byte-by-byte. ↩
A ring buffer (also called a circular buffer) is a fixed-size array that “wraps around” when it reaches the end. Instead of shifting all elements when you remove from the front, it just advances a pointer. When the write pointer reaches the end of the array, it wraps back to the beginning. This makes both enqueue and dequeue $O(1)$ operations. ↩