When Threads Share Memory: Understanding Race Conditions and Atomics

This is Part 2 of a 4-part series on lock-free data structures:

Building a Circular Buffer Queue

Understanding Race Conditions and Atomics (this article)

Building a Lock-Free SPSC Queue

Eliminating False Sharing: Cache-Aware Lock-Free Programming (coming soon)

In the previous article, we built a circular buffer queue that handles items efficiently in constant time. The implementation was elegant—two indices chasing each other around a ring of memory, never needing to shift data or allocate new space. But we made an implicit assumption: only one thread would ever touch it.

That elegant circular buffer? It’s about to break in ways you can’t see.

Your Queue Just Broke

Imagine you’ve deployed your queue to production. One thread produces work items, another consumes them. Your single-threaded tests passed perfectly. The logic is sound. The code is clean.

Then the bug reports start trickling in. Items go missing. Sometimes the consumer reads garbage data that was never enqueued. Occasionally, the whole thing crashes. You stare at the code, add logging, run it again—and it works fine.

This is a classic Heisenbug—a bug that vanishes when you try to observe it. The logging you added may change the timing just enough to hide the problem. The debugger’s breakpoints synchronised the threads in ways that masked the race. Your tests pass because they run too fast to trigger the exact interleaving that breaks everything.

This is the nightmare of concurrency. Your code isn’t wrong in any obvious way. The race condition hiding inside it only reveals itself when the timing is exactly wrong, which might be once every million operations.

This article explains what’s happening beneath the surface when threads share memory. We won’t build anything yet—that comes in the next article. But without understanding these concepts, any lock-free code you write will be subtly broken in ways that only appear in production at 3am on a Saturday.

By the end, you’ll understand why sharing memory between threads is dangerous, what atomic operations actually do, and how Acquire/Release memory ordering makes concurrent code correct. These concepts apply to any language—Zig, Rust, C, C++, or Go—because they reflect how modern CPUs actually work.

I’ll refer to the previous article as Article 1.

A Race Condition in Slow Motion

Let’s revisit our circular buffer from Article 1. Here’s the core structure, stripped down to the essentials:

// From Article 1 - our simple queue
data: [capacity + 1]i32,
front: usize = 0,  // Consumer reads from here
back: usize = 0,   // Producer writes here

The producer adds items by writing to data[back] and then incrementing back. The consumer reads from data[front] and increments front. Simple enough.

Now add a second thread. The producer runs on one CPU core, the consumer on another. Both are touching the same memory. Here’s what the producer does:

// Producer's enqueue (simplified)
self.data[self.back] = value;
self.back = next_back;

And the consumer:

// Consumer's dequeue (simplified)
if (self.front != self.back) {
    const value = self.data[self.front];
    self.front = next_front;
    return value;
}

This looks fine. The consumer checks whether there’s data available by comparing front and back. If they’re different, there’s something to read. What could go wrong?

Everything. Let me walk you through one specific disaster, step by step:

Time    Producer (Core 1)              Consumer (Core 2)
────    ─────────────────              ──────────────────
T1      Reads back = 3
T2                                     Reads back = 3
T3                                     Reads front = 3
T4                                     Compares: 3 == 3, queue empty
T5      Writes data[3] = 42
T6      Writes back = 4
T7                                     ... misses the item entirely

The consumer checked back before the producer updated it. By the time the producer finished writing, the consumer had already decided the queue was empty and moved on. The item is lost forever.

But it gets worse. Consider this scenario:

Time    Producer (Core 1)              Consumer (Core 2)
────    ─────────────────              ──────────────────
T1      Writes data[3] = 42
T2                                     Reads back = 4 (sees the update!)
T3                                     Reads front = 3
T4                                     Compares: 3 != 4, has data
T5                                     Reads data[3] = ??? (garbage!)
T6      Writes back = 4

Wait, how can the consumer see back = 4 at T2 when the producer doesn’t write it until T6? This seems impossible, but it happens. The CPU or compiler might reorder instructions—executing the back update before the data write—so from another thread’s perspective, the timeline is scrambled. The problem isn’t just about timing—it’s about visibility and ordering. Writes from one thread aren’t instantly visible to other threads, and they might become visible in a different order than they were written.

This is where most developers’ mental model breaks down completely.

Why Your Writes Are Invisible

Here’s the mental model most of us carry around: there’s one memory, all threads see the same memory, and when I write a value, it’s immediately there for everyone to read.

This model is wrong. Comfortingly wrong, but wrong.

Modern CPUs have a hierarchy of caches between each core and main memory. When your code writes a value, it goes to the core’s local cache first, not to main memory. Other cores have their own caches, which might contain stale copies of the same data.

┌─────────────────┐         ┌─────────────────┐
│     Core 1      │         │     Core 2      │
│   (Producer)    │         │   (Consumer)    │
│  ┌───────────┐  │         │  ┌───────────┐  │
│  │  L1/L2    │  │         │  │  L1/L2    │  │
│  │  Cache    │  │         │  │  Cache    │  │
│  │           │  │         │  │           │  │
│  │ back = 4  │  │         │  │ back = 3  │  │
│  │ data[3]=42│  │         │  │ data[3]=? │  │
│  └─────┬─────┘  │         │  └─────┬─────┘  │
└────────┼────────┘         └────────┼────────┘
         │                           │
         └───────────┬───────────────┘
                     │
             ┌───────▼───────┐
             │  Main Memory  │
             │               │
             │  back = 3     │  ← Not yet updated!
             │  data[3] = 0  │
             └───────────────┘

Each core sees its own version of memory. The producer has written back = 4 and data[3] = 42, but those writes are sitting in Core 1’s cache. Core 2 still sees the old values. Eventually the caches will synchronise, but “eventually” might be microseconds later—an eternity in CPU time.

There’s another problem lurking here: reordering. Compilers and CPUs reorder operations to squeeze out more performance. Your code says:

self.data[self.back] = value;  // Write A
self.back = next_back;         // Write B

But the CPU might execute Write B before Write A. On a single thread, this doesn’t matter—the end result is the same. With multiple threads, it’s catastrophic. Another thread might see the updated back but read uninitialised garbage from data because that write hasn’t happened yet.

This is the reality we’re working with. Writes are invisible until something forces them to become visible. Operations happen in a different order than we wrote them. Our nice sequential mental model is a lie.

So how do we fix this? We need special instructions that give us control over visibility and ordering. These are called atomic operations.

Atomic Operations: Writes That Can’t Be Ignored

The word “atomic” comes from the Greek atomos, meaning indivisible. An atomic operation completes fully or not at all. No other thread can observe it half-finished.

Regular variables have no such guarantee. A 64-bit write on some architectures might appear to another thread as two separate 32-bit writes. The other thread could read a value that’s half old and half new—this is called a torn read, where you observe a value that never actually existed in your program. Atomic operations prevent this.

In Zig, you could use the standard library’s atomic wrapper:

const std = @import("std");

// Instead of a plain variable:
var counter: usize = 0;

// Use an atomic wrapper:
var counter = std.atomic.Value(usize).init(0);

// Read the value:
const value = counter.load(.acquire);

// Write the value:
counter.store(new_value, .release);

Two things changed. First, we wrapped the variable in std.atomic.Value(T), which tells the compiler to use atomic instructions. Second, we’re using .load() and .store() with a second parameter instead of direct assignment.

That second parameter—.acquire, .release—is where the real magic happens. It controls not just whether the operation is atomic, but when the change becomes visible to other threads and what ordering guarantees we get.

If we just make back atomic but use the wrong memory ordering, we still have problems. The consumer might see the updated back before seeing the data that was written. Atomicity alone doesn’t fix our race condition. We need ordering guarantees too.

This is where it gets subtle, and where most concurrent code goes wrong.

Memory Ordering: The Rules of Visibility

Memory ordering is the part of concurrent programming that trips up even experienced developers (myself included). Most tutorials either go too deep into formal memory models and “happens-before” relations, or too shallow with advice like “just use SeqCst everywhere.” I’ll aim for the practical middle ground.

I want to give you a mental model that’s accurate enough to write correct code, without drowning in academic formalism.

The Checkpoint Analogy

Think of memory ordering in terms of checkpoint flags at a race.

Release means planting a checkpoint flag. When a thread performs a Release store, it’s saying: “Everything I wrote before this point is finished and ready to be seen. I’m planting a flag here.”

Thread A's timeline:
─────────────────────────────────────────────────────────►
   write(x)    write(y)    write(z)    RELEASE store(back)
   ─────────────────────────────────────┤
   All these writes are "published"     │
   when the Release happens             🚩 checkpoint

Acquire means reading the checkpoint flag. When another thread performs an Acquire load, it’s saying: “I want to see everything that was finished before the Release that published this value.”

Thread B's timeline:
─────────────────────────────────────────────────────────►
   ACQUIRE load(back)    read(x)    read(y)    read(z)
          🚩──────────────────────────────────────────────
          │ After acquiring, Thread B is guaranteed to
          │ see all writes that happened before the Release

The key insight is that Acquire and Release work as a pair. The Release “publishes” a set of writes. The Acquire “subscribes” to see those writes. If you only use one without the other, you don’t get the synchronisation you need.

Applied to Our Queue

Let’s fix our circular buffer using this pattern:

// Producer (uses Release when updating back)
self.data[self.back] = value;              // Write the data first
self.back.store(next_back, .release);      // Then publish with Release

// Consumer (uses Acquire when reading back)
const back = self.back.load(.acquire);     // Subscribe with Acquire
if (self.front != back) {
    const value = self.data[self.front];   // Now guaranteed to see the data
    // ...
}

Here’s what happens with proper synchronisation:

Producer                              Consumer
────────                              ────────
write data[3] = 42
        │
        ▼
store back = 4 (.release) ─────────► load back (.acquire)
        🚩                                   │
        │                                    ▼
        │                              read data[3]
        │                              ✓ sees 42, guaranteed

The Release on the producer side ensures that the data write is visible before back is updated. The Acquire on the consumer side ensures that when it sees back = 4, it also sees all the writes that happened before that Release. The race condition is gone.

Other Memory Orderings

You’ll encounter other orderings in documentation, so let me briefly explain them:

Relaxed (.monotonic in Zig): No ordering guarantees at all. The operation is atomic (indivisible), but you get no promises about when other threads will see it or what other writes will be visible. Use this only for independent counters where you don’t need synchronisation with other data.

SeqCst (.seq_cst in Zig): Sequential consistency—the strongest guarantee. All threads see all SeqCst operations in the same global order. This requires expensive memory barriers on most architectures. It’s often overkill; Acquire/Release is usually sufficient and faster.

For a single-producer, single-consumer queue, the rule is straightforward:

Acquire when reading the other thread’s position
Release when updating your own position
Relaxed when reading your own position (you wrote it, so you already know what’s there)

The Simple Rule for Producer-Consumer Patterns

Let me distil everything into a pattern you can memorise and apply:

Producer writes data first, then updates position with Release
Consumer reads position with Acquire, then reads data
The Acquire/Release pair creates a “happens-before” relationship

The symmetry is elegant:

When you update something that another thread reads → use Release
When you read something that another thread updates → use Acquire
When you access your own data that only you modify → use Relaxed

Here’s the pattern visualised for our queue:

┌─────────────────────────────────────────────────────────┐
│                    PRODUCER                             │
│                                                         │
│   1. Write data to buffer                               │
│   2. Update 'back' with RELEASE                         │
│      (publishes the data write)                         │
└────────────────────────┬────────────────────────────────┘
                         │
                    synchronises
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│                    CONSUMER                             │
│                                                         │
│   1. Read 'back' with ACQUIRE                           │
│      (subscribes to producer's writes)                  │
│   2. Read data from buffer                              │
│      (guaranteed to see producer's write)               │
└─────────────────────────────────────────────────────────┘

Common mistakes to avoid:

Using Relaxed everywhere “because it’s faster”—you’ll have race conditions that only appear under load
Using SeqCst everywhere “to be safe”—unnecessary overhead, and you won’t understand why your code works
Forgetting that you need both Acquire and Release—they’re a pair, like lock and unlock

What About Mutexes?

We’ve covered how to use atomics with Acquire/Release semantics to synchronise threads. But there’s an elephant in the room: if atomics are so tricky to get right, why not just use mutexes everywhere?

Mutexes use atomics internally, but they add coordination overhead on top. Here’s a simplified view of what a mutex does:

// Simplified mutex concept
lock:   while (!flag.cmpxchgWeak(false, true, .acquire, .monotonic)) { wait(); }
unlock: flag.store(false, .release);

The Acquire in the lock ensures you see all writes made before the previous unlock. The Release in the unlock ensures your writes are visible before the next lock. Mutexes give you Acquire/Release semantics automatically, without you having to think about it.

But mutexes add overhead:

Kernel involvement when the lock is contested (one thread has to wait)
Context switches between threads
Blocking—a thread waiting for a mutex can’t do anything else

When to consider using mutexes:

Multiple writers accessing the same data
Complex operations that can’t be expressed as a single atomic
When you need blocking behaviour (wait until resource is available)
When simplicity matters more than maximum performance

When to consider raw atomics:

Single-producer, single-consumer patterns (like our queue)
Simple counters and flags
When you absolutely need maximum performance
Real-time systems where blocking is unacceptable

For the vast majority of concurrent code, mutexes are probably a fine choice. They’re easier to use correctly, and the performance difference often doesn’t matter. Raw atomics are a specialised tool for the cases where it does matter—and you should reach for them only when you have a specific need and understand the trade-offs.

Real-world example: Zig’s new Async I/O interface uses mutexes internally. The std.Io.Threaded uses std.Thread.Mutex for synchronising access to its internal state, as do the IoUring and Kqueue implementations. Even high-performance systems reach for mutexes when the trade-offs make sense.

Ready to Build

We started with a single-threaded queue and watched it fall apart under concurrent access. The culprits: CPU caches, instruction reordering, and invisible writes that conspire against our sequential mental model. Atomic operations give us indivisibility, but we needed memory ordering—specifically Acquire/Release semantics—to control when changes become visible across threads.

The core insight: Release plants a checkpoint flag that publishes your writes. Acquire reads that flag and subscribes to see them. Together, they let threads coordinate safely through shared memory, without locks.

What’s Next

Remember our circular buffer from the previous article? It’s about to become the backbone of a lock-free queue.

We’ve learned the theory—race conditions, CPU caches, reordering, atomics, Acquire/Release. In the next article, we’ll put it all together and build something you can actually use in production: a complete single-producer, single-consumer lock-free queue.

You’ll see exactly where each memory ordering goes and why. We’ll walk through every line of code, explaining the reasoning behind each .acquire and .release. By the end, you’ll have a queue that can pass millions of messages per second between threads without a single mutex.

If the Acquire/Release pattern doesn’t feel natural yet, that’s completely normal. You’ll internalise it by seeing it applied line by line in real code. For now, the checkpoint analogy—Release plants a flag, Acquire reads it—should be enough to follow along.

Subscribe below to be notified when the next article is published. And if you want to go deeper into memory ordering theory, I recommend Jeff Preshing’s excellent series on Acquire and Release semantics—it’s the clearest explanation I’ve found anywhere.

See you in the next one, where we build something real.