When Threads Share Memory: Understanding Race Conditions and Atomics
This is Part 2 of a 4-part series on lock-free data structures:
- Building a Circular Buffer Queue
- Understanding Race Conditions and Atomics (this article)
- Building a Lock-Free SPSC Queue
- Eliminating False Sharing: Cache-Aware Lock-Free Programming (coming soon)
In the previous article, we built a circular buffer queue that handles items efficiently in constant time. The implementation was elegantβtwo indices chasing each other around a ring of memory, never needing to shift data or allocate new space. But we made an implicit assumption: only one thread would ever touch it.
That elegant circular buffer? Itβs about to break in ways you canβt see.
Your Queue Just Broke
Imagine youβve deployed your queue to production. One thread produces work items, another consumes them. Your single-threaded tests passed perfectly. The logic is sound. The code is clean.
Then the bug reports start trickling in. Items go missing. Sometimes the consumer reads garbage data that was never enqueued. Occasionally, the whole thing crashes. You stare at the code, add logging, run it againβand it works fine.
This is a classic Heisenbugβa bug that vanishes when you try to observe it. The logging you added may change the timing just enough to hide the problem. The debuggerβs breakpoints synchronised the threads in ways that masked the race. Your tests pass because they run too fast to trigger the exact interleaving that breaks everything.
This is the nightmare of concurrency. Your code isnβt wrong in any obvious way. The race condition hiding inside it only reveals itself when the timing is exactly wrong, which might be once every million operations.
This article explains whatβs happening beneath the surface when threads share memory. We wonβt build anything yetβthat comes in the next article. But without understanding these concepts, any lock-free code you write will be subtly broken in ways that only appear in production at 3am on a Saturday.
By the end, youβll understand why sharing memory between threads is dangerous, what atomic operations actually do, and how Acquire/Release memory ordering makes concurrent code correct. These concepts apply to any languageβZig, Rust, C, C++, or Goβbecause they reflect how modern CPUs actually work.
Iβll refer to the previous article as Article 1.
A Race Condition in Slow Motion
Letβs revisit our circular buffer from Article 1. Hereβs the core structure, stripped down to the essentials:
// From Article 1 - our simple queue
data: [capacity + 1]i32,
front: usize = 0, // Consumer reads from here
back: usize = 0, // Producer writes here
The producer adds items by writing to data[back] and then incrementing back. The consumer reads from data[front] and increments front. Simple enough.
Now add a second thread. The producer runs on one CPU core, the consumer on another. Both are touching the same memory. Hereβs what the producer does:
// Producer's enqueue (simplified)
self.data[self.back] = value;
self.back = next_back;
And the consumer:
// Consumer's dequeue (simplified)
if (self.front != self.back) {
const value = self.data[self.front];
self.front = next_front;
return value;
}
This looks fine. The consumer checks whether thereβs data available by comparing front and back. If theyβre different, thereβs something to read. What could go wrong?
Everything. Let me walk you through one specific disaster, step by step:
Time Producer (Core 1) Consumer (Core 2)
ββββ βββββββββββββββββ ββββββββββββββββββ
T1 Reads back = 3
T2 Reads back = 3
T3 Reads front = 3
T4 Compares: 3 == 3, queue empty
T5 Writes data[3] = 42
T6 Writes back = 4
T7 ... misses the item entirely
The consumer checked back before the producer updated it. By the time the producer finished writing, the consumer had already decided the queue was empty and moved on. The item is lost forever.
But it gets worse. Consider this scenario:
Time Producer (Core 1) Consumer (Core 2)
ββββ βββββββββββββββββ ββββββββββββββββββ
T1 Writes data[3] = 42
T2 Reads back = 4 (sees the update!)
T3 Reads front = 3
T4 Compares: 3 != 4, has data
T5 Reads data[3] = ??? (garbage!)
T6 Writes back = 4
Wait, how can the consumer see back = 4 at T2 when the producer doesnβt write it until T6? This seems impossible, but it happens. The CPU or compiler might reorder instructionsβexecuting the back update before the data writeβso from another threadβs perspective, the timeline is scrambled. The problem isnβt just about timingβitβs about visibility and ordering. Writes from one thread arenβt instantly visible to other threads, and they might become visible in a different order than they were written.
This is where most developersβ mental model breaks down completely.
Why Your Writes Are Invisible
Hereβs the mental model most of us carry around: thereβs one memory, all threads see the same memory, and when I write a value, itβs immediately there for everyone to read.
This model is wrong. Comfortingly wrong, but wrong.
Modern CPUs have a hierarchy of caches between each core and main memory. When your code writes a value, it goes to the coreβs local cache first, not to main memory. Other cores have their own caches, which might contain stale copies of the same data.
βββββββββββββββββββ βββββββββββββββββββ
β Core 1 β β Core 2 β
β (Producer) β β (Consumer) β
β βββββββββββββ β β βββββββββββββ β
β β L1/L2 β β β β L1/L2 β β
β β Cache β β β β Cache β β
β β β β β β β β
β β back = 4 β β β β back = 3 β β
β β data[3]=42β β β β data[3]=? β β
β βββββββ¬ββββββ β β βββββββ¬ββββββ β
ββββββββββΌβββββββββ ββββββββββΌβββββββββ
β β
βββββββββββββ¬ββββββββββββββββ
β
βββββββββΌββββββββ
β Main Memory β
β β
β back = 3 β β Not yet updated!
β data[3] = 0 β
βββββββββββββββββ
Each core sees its own version of memory. The producer has written back = 4 and data[3] = 42, but those writes are sitting in Core 1βs cache. Core 2 still sees the old values. Eventually the caches will synchronise, but βeventuallyβ might be microseconds laterβan eternity in CPU time.
Thereβs another problem lurking here: reordering. Compilers and CPUs reorder operations to squeeze out more performance. Your code says:
self.data[self.back] = value; // Write A
self.back = next_back; // Write B
But the CPU might execute Write B before Write A. On a single thread, this doesnβt matterβthe end result is the same. With multiple threads, itβs catastrophic. Another thread might see the updated back but read uninitialised garbage from data because that write hasnβt happened yet.
This is the reality weβre working with. Writes are invisible until something forces them to become visible. Operations happen in a different order than we wrote them. Our nice sequential mental model is a lie.
So how do we fix this? We need special instructions that give us control over visibility and ordering. These are called atomic operations.
Atomic Operations: Writes That Canβt Be Ignored
The word βatomicβ comes from the Greek atomos, meaning indivisible. An atomic operation completes fully or not at all. No other thread can observe it half-finished.
Regular variables have no such guarantee. A 64-bit write on some architectures might appear to another thread as two separate 32-bit writes. The other thread could read a value thatβs half old and half newβthis is called a torn read, where you observe a value that never actually existed in your program. Atomic operations prevent this.
In Zig, you could use the standard libraryβs atomic wrapper:
const std = @import("std");
// Instead of a plain variable:
var counter: usize = 0;
// Use an atomic wrapper:
var counter = std.atomic.Value(usize).init(0);
// Read the value:
const value = counter.load(.acquire);
// Write the value:
counter.store(new_value, .release);
Two things changed. First, we wrapped the variable in std.atomic.Value(T), which tells the compiler to use atomic instructions. Second, weβre using .load() and .store() with a second parameter instead of direct assignment.
That second parameterβ.acquire, .releaseβis where the real magic happens. It controls not just whether the operation is atomic, but when the change becomes visible to other threads and what ordering guarantees we get.
If we just make back atomic but use the wrong memory ordering, we still have problems. The consumer might see the updated back before seeing the data that was written. Atomicity alone doesnβt fix our race condition. We need ordering guarantees too.
This is where it gets subtle, and where most concurrent code goes wrong.
Memory Ordering: The Rules of Visibility
Memory ordering is the part of concurrent programming that trips up even experienced developers (myself included). Most tutorials either go too deep into formal memory models and βhappens-beforeβ relations, or too shallow with advice like βjust use SeqCst everywhere.β Iβll aim for the practical middle ground.
I want to give you a mental model thatβs accurate enough to write correct code, without drowning in academic formalism.
The Checkpoint Analogy
Think of memory ordering in terms of checkpoint flags at a race.
Release means planting a checkpoint flag. When a thread performs a Release store, itβs saying: βEverything I wrote before this point is finished and ready to be seen. Iβm planting a flag here.β
Thread A's timeline:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊ
write(x) write(y) write(z) RELEASE store(back)
ββββββββββββββββββββββββββββββββββββββ€
All these writes are "published" β
when the Release happens π© checkpoint
Acquire means reading the checkpoint flag. When another thread performs an Acquire load, itβs saying: βI want to see everything that was finished before the Release that published this value.β
Thread B's timeline:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊ
ACQUIRE load(back) read(x) read(y) read(z)
π©ββββββββββββββββββββββββββββββββββββββββββββββ
β After acquiring, Thread B is guaranteed to
β see all writes that happened before the Release
The key insight is that Acquire and Release work as a pair. The Release βpublishesβ a set of writes. The Acquire βsubscribesβ to see those writes. If you only use one without the other, you donβt get the synchronisation you need.
Applied to Our Queue
Letβs fix our circular buffer using this pattern:
// Producer (uses Release when updating back)
self.data[self.back] = value; // Write the data first
self.back.store(next_back, .release); // Then publish with Release
// Consumer (uses Acquire when reading back)
const back = self.back.load(.acquire); // Subscribe with Acquire
if (self.front != back) {
const value = self.data[self.front]; // Now guaranteed to see the data
// ...
}
Hereβs what happens with proper synchronisation:
Producer Consumer
ββββββββ ββββββββ
write data[3] = 42
β
βΌ
store back = 4 (.release) ββββββββββΊ load back (.acquire)
π© β
β βΌ
β read data[3]
β β sees 42, guaranteed
The Release on the producer side ensures that the data write is visible before back is updated. The Acquire on the consumer side ensures that when it sees back = 4, it also sees all the writes that happened before that Release. The race condition is gone.
Other Memory Orderings
Youβll encounter other orderings in documentation, so let me briefly explain them:
Relaxed (.monotonic in Zig): No ordering guarantees at all. The operation is atomic (indivisible), but you get no promises about when other threads will see it or what other writes will be visible. Use this only for independent counters where you donβt need synchronisation with other data.
SeqCst (.seq_cst in Zig): Sequential consistencyβthe strongest guarantee. All threads see all SeqCst operations in the same global order. This requires expensive memory barriers on most architectures. Itβs often overkill; Acquire/Release is usually sufficient and faster.
For a single-producer, single-consumer queue, the rule is straightforward:
- Acquire when reading the other threadβs position
- Release when updating your own position
- Relaxed when reading your own position (you wrote it, so you already know whatβs there)
The Simple Rule for Producer-Consumer Patterns
Let me distil everything into a pattern you can memorise and apply:
- Producer writes data first, then updates position with Release
- Consumer reads position with Acquire, then reads data
- The Acquire/Release pair creates a βhappens-beforeβ relationship
The symmetry is elegant:
- When you update something that another thread reads β use Release
- When you read something that another thread updates β use Acquire
- When you access your own data that only you modify β use Relaxed
Hereβs the pattern visualised for our queue:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PRODUCER β
β β
β 1. Write data to buffer β
β 2. Update 'back' with RELEASE β
β (publishes the data write) β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
synchronises
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONSUMER β
β β
β 1. Read 'back' with ACQUIRE β
β (subscribes to producer's writes) β
β 2. Read data from buffer β
β (guaranteed to see producer's write) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Common mistakes to avoid:
- Using Relaxed everywhere βbecause itβs fasterββyouβll have race conditions that only appear under load
- Using SeqCst everywhere βto be safeββunnecessary overhead, and you wonβt understand why your code works
- Forgetting that you need both Acquire and Releaseβtheyβre a pair, like lock and unlock
What About Mutexes?
Weβve covered how to use atomics with Acquire/Release semantics to synchronise threads. But thereβs an elephant in the room: if atomics are so tricky to get right, why not just use mutexes everywhere?
Mutexes use atomics internally, but they add coordination overhead on top. Hereβs a simplified view of what a mutex does:
// Simplified mutex concept
lock: while (!flag.cmpxchgWeak(false, true, .acquire, .monotonic)) { wait(); }
unlock: flag.store(false, .release);
The Acquire in the lock ensures you see all writes made before the previous unlock. The Release in the unlock ensures your writes are visible before the next lock. Mutexes give you Acquire/Release semantics automatically, without you having to think about it.
But mutexes add overhead:
- Kernel involvement when the lock is contested (one thread has to wait)
- Context switches between threads
- Blockingβa thread waiting for a mutex canβt do anything else
When to consider using mutexes:
- Multiple writers accessing the same data
- Complex operations that canβt be expressed as a single atomic
- When you need blocking behaviour (wait until resource is available)
- When simplicity matters more than maximum performance
When to consider raw atomics:
- Single-producer, single-consumer patterns (like our queue)
- Simple counters and flags
- When you absolutely need maximum performance
- Real-time systems where blocking is unacceptable
For the vast majority of concurrent code, mutexes are probably a fine choice. Theyβre easier to use correctly, and the performance difference often doesnβt matter. Raw atomics are a specialised tool for the cases where it does matterβand you should reach for them only when you have a specific need and understand the trade-offs.
Real-world example: Zigβs new Async I/O interface uses mutexes internally. The
std.Io.Threadedusesstd.Thread.Mutexfor synchronising access to its internal state, as do the IoUring and Kqueue implementations. Even high-performance systems reach for mutexes when the trade-offs make sense.
Ready to Build
We started with a single-threaded queue and watched it fall apart under concurrent access. The culprits: CPU caches, instruction reordering, and invisible writes that conspire against our sequential mental model. Atomic operations give us indivisibility, but we needed memory orderingβspecifically Acquire/Release semanticsβto control when changes become visible across threads.
The core insight: Release plants a checkpoint flag that publishes your writes. Acquire reads that flag and subscribes to see them. Together, they let threads coordinate safely through shared memory, without locks.
Whatβs Next
Remember our circular buffer from the previous article? Itβs about to become the backbone of a lock-free queue.
Weβve learned the theoryβrace conditions, CPU caches, reordering, atomics, Acquire/Release. In the next article, weβll put it all together and build something you can actually use in production: a complete single-producer, single-consumer lock-free queue.
Youβll see exactly where each memory ordering goes and why. Weβll walk through every line of code, explaining the reasoning behind each .acquire and .release. By the end, youβll have a queue that can pass millions of messages per second between threads without a single mutex.
If the Acquire/Release pattern doesnβt feel natural yet, thatβs completely normal. Youβll internalise it by seeing it applied line by line in real code. For now, the checkpoint analogyβRelease plants a flag, Acquire reads itβshould be enough to follow along.
Subscribe below to be notified when the next article is published. And if you want to go deeper into memory ordering theory, I recommend Jeff Preshingβs excellent series on Acquire and Release semanticsβitβs the clearest explanation Iβve found anywhere.
See you in the next one, where we build something real.