Building a Lock-Free Single Producer, Single Consumer Queue (FIFO)
This is Part 3 of a 4-part series on lock-free Queue data structure:
- Building a Circular Buffer Queue
- Understanding Race Conditions and Atomics
- Building a Lock-Free SPSC Queue (this article)
- Eliminating False Sharing: Cache-Aware Lock-Free Programming (coming soon)
Imagine your web application processes user uploads in the background. Each request adds a job to a queue, worker threads pick them up and process them. Simple enough. But under heavy load, something strange happens. Workers spend more time waiting for the mutex lock than actually working. Your CPU cores sit idle whilst threads queue up for that single lock. The very thing meant to improve performance has become the bottleneck.
What if threads could coordinate without ever blocking? What if adding a job and grabbing a job could happen simultaneously, on different cores, without locks, without kernel calls, without waiting?
This article will show you how to build a lock-free Queue from scratch.
In Article 2, we discovered why sharing memory between threads is dangerous and how Acquire/Release semantics solve the visibility problem. We watched a perfectly good circular buffer break in subtle ways when two threads touched it simultaneously. We understood the checkpoint analogy—Release plants a flag, Acquire reads it.
Now it’s time to build something real. We’ve got the conceptual toolkit, so let’s use it.
Table of Contents
- Table of Contents
- What We’re Building
- When Should You Use an SPSC Queue?
- Understanding the “Try” Pattern
- How It Works: The Bird’s Eye View
- The Implementation: Breaking Down the Struct
- Initialisation: Enforcing Invariants
- The tryPush Operation: Producer’s Perspective
- The tryPop Operation: Consumer’s Perspective
- The Approximate Methods: Monitoring Without Guarantees
- Testing for Correctness
- A Complete Working Example
- Adapting to Other Languages
- Key Takeaways
- What’s Next
What We’re Building
We’re going to build a lock-free queue where one thread (the producer) pushes items and another thread (the consumer) pops them. No mutexes, no kernel calls, no blocking. This pattern is called Single Producer Single Consumer, or SPSC for short.
We will use the same circular buffer pattern from Article 1—the same power-of-two capacity, the same mask trick for fast indexing. But we will add atomic operations with careful memory ordering to make it thread-safe. The producer and consumer will coordinate through two atomic counters, using the Acquire/Release semantics we learnt in Article 2.
What you’ll learn in this article:
- How to make a circular buffer thread-safe using atomics
- Where to use
.acquireand.releasememory ordering (and why) - How unbounded counters simplify full/empty detection
- Why cache line alignment matters (a peek into the next article)
- How to test lock-free code for correctness
By the end, you’ll have a complete, working queue you can use in production.
When Should You Use an SPSC Queue?
Before diving into implementation, let’s be clear about when this pattern makes sense. Lock-free programming isn’t always faster than mutexes—it’s a specialised tool for specific scenarios. Like when you need predictable latency without worst-case blocking, when you’re dealing with high-contention workloads, or when you absolutely cannot afford kernel involvement.
Some common use cases for them include:
-
High-throughput API servers where request handlers need to offload CPU-intensive work to background threads without blocking. Your handler pushes the task to a queue and immediately returns a response whilst a worker thread processes it. No waiting, no lock contention.
-
WebSocket message broadcasting where you need to fan out messages to thousands of connected clients. The main thread receives a message and pushes it to queues that worker threads consume to send to clients. Lock-free queues keep the main thread’s latency predictable even under heavy load.
-
Game engines passing render commands between the game logic thread and the render thread. The game thread produces draw calls at 60fps or higher, and the render thread consumes them. Any blocking here means dropped frames that players will notice.
Lock-free programming breaks down or become slow when used incorrectly. For simpler scenarios with occasional or rare contention, mutexes are often fast and definitely easier to get right. It comes down to choosing the right tool for the problem—don’t reach for lock-free programming just because it sounds impressive.
If you need multiple producers or multiple consumers, an SPSC queue won’t work. You’ll need either a more complex lock-free structure (like an MPMC queue) or simply use mutexes. Don’t force the SPSC pattern where it doesn’t fit.
Comparing SPSC Queues to Mutex-Based Alternatives
Here’s what you’re trading when you choose lock-free over locks:
| Feature | Mutex-Based Queue | SPSC Queue |
|---|---|---|
| Kernel calls | Yes, on contention | Never |
| Predictable latency | No (depends on lock holder) | Yes (bounded operation time) |
| Multiple producers | Yes | No (by design) |
| Multiple consumers | Yes | No (by design) |
| Implementation complexity | Simple | Moderate (requires careful ordering) |
| Use case | General purpose | High-performance producer-consumer patterns |
Now that you know when to use this pattern, let’s understand how it works before we write any code.
Understanding the “Try” Pattern
Before we look at any code, there’s an important design decision to understand. Our queue won’t block when it’s full or empty. Instead, our methods return immediately with a success or failure indicator.
This seems a common pattern in lock-free programming. When the queue is full, the producer can’t wait around for space to open up—that would require blocking, which defeats the whole point of being lock-free. Similarly, when the queue is empty, the consumer can’t wait for data.
So we use methods with “try” in their names:
tryPush()attempts to add (enqueue) an item. It returnstrueon success,falseif the queue is full.tryPop()attempts to remove (dequeue) an item. It returns the item if available,nullif the queue is empty.
The caller decides how to handle these cases. Maybe retry immediately, maybe drop the item and log an error, maybe switch to a different queue. The queue doesn’t make that decision for you.
Now let’s see what using this queue looks like:
const std = @import("std");
// Create a queue with a buffer you provide
var buffer: [1024]Message = undefined;
var queue = SpscQueue(Message).init(&buffer);
// Producer thread
if (queue.tryPush(message)) {
// Success - message is now in the queue
} else {
// Queue is full - handle backpressure
}
// Consumer thread
if (queue.tryPop()) |message| {
// Got a message - process it
} else {
// Queue is empty - nothing to do
}
Notice the pattern? The producer checks the return value to know if the push succeeded. The consumer uses Zig’s optional unwrapping syntax—if (queue.tryPop()) |message|—which both checks for null and unwraps the value in one go.
This design is based on Dmitry Vyukov’s unbounded SPSC queue, a well-established pattern in lock-free programming. We’re implementing the same ideas in Zig.
How It Works: The Bird’s Eye View
Before we dive into the struct definition and implementation details, let’s understand the core design at a high level. This mental model will make everything that follows much easier to grasp.
The Three Core Components
Our queue has three main components:
First, a buffer. This is just an array that holds the actual items. The caller provides this buffer when creating the queue, which means we never allocate memory internally. In embedded systems or real-time code, this is crucial because you can use stack memory or a pre-allocated pool without involving a heap allocator.
Second, two atomic counters. These are the heart of our synchronisation:
producer_postracks where the producer writes nextconsumer_postracks where the consumer reads next
Both counters are atomic values with specific memory ordering, which we’ll examine in detail later.
Third, a mask for fast indexing. This is the power-of-two optimisation from Article 1. When the buffer capacity is a power of two, we can use bitwise AND instead of expensive modulo operations to convert positions into array indices.
How Threads Coordinate
Here’s how the threads coordinate. The producer writes data to the buffer, then updates producer_pos with Release ordering to publish the write. The consumer loads producer_pos with Acquire ordering to see that write, then reads the data. The symmetry is beautiful—both threads follow the same pattern in opposite directions.
Why Unbounded Counters Simplify Everything
One design choice that might seem odd at first: the positions grow unbounded. They start at zero and just keep incrementing—1, 2, 3, 4… forever, never wrapping back to zero. This actually simplifies our logic:
- Empty queue:
consumer_pos == producer_pos(consumer has caught up) - Full queue:
producer_pos - consumer_pos >= capacity(we’re one full lap ahead)
We only apply the mask when we need to access the actual buffer slot. The positions themselves just keep counting up.
Here’s a visual representation of the structure that enables this design:
┌─────────────────────────────────────────────────────────┐
│ SpscQueue struct │
├─────────────────────────────────────────────────────────┤
│ buffer: []T → points to external memory │
│ capacity: usize → e.g., 1024 │
│ mask: usize → e.g., 1023 │
├─────────────────────────────────────────────────────────┤
│ ▼ cache line aligned │
│ consumer_pos: atomic(usize) → e.g., 5,000,042 │
├─────────────────────────────────────────────────────────┤
│ ▼ cache line aligned │
│ producer_pos: atomic(usize) → e.g., 5,000,050 │
└─────────────────────────────────────────────────────────┘
│
▼
┌───┬───┬───┬───┬───┬───┬───┬────┐
│ 0 │ 1 │ 2 │...│...│...│...│1023│ buffer (external)
└───┴───┴───┴───┴───┴───┴───┴────┘
Now that you understand the overall design, let’s see it in code.
The Implementation: Breaking Down the Struct
Here’s the queue structure with the fields. Every field exists for a specific reason, which we’ll examine one after the other.
pub fn SpscQueue(comptime T: type) type {
return struct {
const Self = @This();
const cache_line = std.atomic.cache_line;
buffer: []T,
capacity: usize, // Must be power of 2
mask: usize, // capacity - 1, for fast modulo
// Consumer position (only modified by consumer)
// Aligned to cache line to prevent false sharing
consumer_pos: std.atomic.Value(usize) align(cache_line),
// Producer position (only modified by producer)
// Aligned to cache line to prevent false sharing
producer_pos: std.atomic.Value(usize) align(cache_line),
// ... methods follow
};
}
Notice the @This() keyword. This is Zig’s way of referring to the current struct type being defined. It is a handy way to refer to your own type when using a function that returns a struct.
Now let’s walk through each field and understand why they’re there.
The Buffer and Capacity Fields
buffer: []T stores the queue items. We take a slice rather than allocating internally. This gives the caller complete control over memory lifetime and allocation strategy. For embedded systems or real-time code, this is crucial because you can use stack memory or a pre-allocated pool without involving a heap allocator. The queue itself never touches an allocator.
capacity: usize tells us how many items the buffer can hold. This must be a power of two—2, 4, 8, 16, 32, and so on. We’ll enforce this constraint in the init function. This requirement isn’t arbitrary; it enables a significant performance optimisation that we’ll see next.
The Mask Field: Fast Modulo Through Bitwise AND
mask: usize is always set to capacity - 1. This unlocks a clever optimisation. When capacity is a power of two, bitwise AND with the mask is mathematically equivalent to modulo, but significantly faster:
// Slow: division operation (tens of cycles)
const index = position % capacity;
// Fast: bitwise AND (single cycle)
const index = position & mask;
// Why it works:
// If capacity = 8 (binary: 1000), mask = 7 (binary: 0111)
// position = 11 (binary: 1011)
// 11 & 7 = 1011 & 0111 = 0011 = 3
// 11 % 8 = 3 ✓
Division is one of the slowest operations a CPU can perform. Modern compilers might optimise a modulo by a constant power-of-two into a mask automatically, but doing it explicitly guarantees the optimisation and makes our intent crystal clear. In a hot loop processing millions of messages, replacing division with bitwise AND makes a measurable difference.
The Synchronisation Fields
Now we reach the heart of the thread coordination.
consumer_pos and producer_pos are where the lock-free magic happens. These are the crucial details to know about these fields:
They’re std.atomic.Value(usize), not plain usize. This wrapper tells the compiler to use atomic instructions, which prevents torn reads (where you see half of an old value and half of a new value) and gives us access to memory ordering controls. Without this wrapper, the compiler might optimise our reads and writes in ways that break thread safety.
They’re aligned to cache lines. You might have spotted the align(cache_line) attribute. We’ll explain this properly in the next article (Article 4), but for now, know that it prevents a performance problem called “false sharing”—where two threads writing to different variables accidentally slow each other down because those variables share a cache line. This single attribute can make the difference between a queue that processes 50 million items per second and one that processes 5 million.
They grow unbounded. Unlike Article 1’s queue where we wrapped positions at capacity boundaries, here we let them grow forever. This is a key insight that dramatically simplifies our full and empty detection:
- Empty:
consumer_pos == producer_pos(consumer has caught up to producer) - Full:
producer_pos - consumer_pos >= capacity(producer is a full lap ahead)
We only apply the mask when accessing the actual buffer slot. The positions themselves just keep incrementing.
A Note on Integer Overflow
Because the position counters keep incrementing, they will eventually reach the maximum value for their type (usize). On 64-bit systems, you’d need to process billions of items per second for decades. Still worried? Use Zig’s wrapping operators (+% and -%) to handle overflow gracefully instead of panicking.
Zig will panic (crash) if an overflow occurs in Debug or ReleaseSafe modes. If you want the counters to wrap around silently (as in some lock-free queue designs), you can use Zig’s wrapping addition and subtraction operators. Choose the behaviour that best fits your use case: explicit panics for safety during development, or silent wrapping for long-lived production queues.
Initialisation: Enforcing Invariants
We need an init function that’ll be used to initialise the queue. This function takes a buffer slice and sets up the internal fields. Crucially, it enforces our invariant that the buffer length must be a power of two.
Here’s the complete init function:
pub fn init(buffer: []T) Self {
// Find smallest power of 2 >= buffer.len
const capacity = std.math.ceilPowerOfTwo(usize, buffer.len) catch unreachable;
// Assert that buffer.len is exactly a power of 2
std.debug.assert(capacity == buffer.len);
return Self{
.buffer = buffer,
.capacity = capacity,
.mask = capacity - 1,
.consumer_pos = std.atomic.Value(usize).init(0),
.producer_pos = std.atomic.Value(usize).init(0),
};
}
Enforcing the Power-of-Two Constraint
We use ceilPowerOfTwo() to find the smallest power of two greater than or equal to the buffer length, then assert it equals the buffer length exactly. If you pass a buffer of size 5, the assertion will fail—you must use sizes like 4, 8, 16, 32, and so on.
You might wonder why we don’t silently round down and only use part of the buffer. We could, but that would waste memory and confuse users about the actual capacity. It’s better to fail fast with a clear error during development. The requirement is simple and easy to satisfy: always use power-of-two buffer sizes.
No Internal Allocation
Notice that init doesn’t allocate anything. It just takes ownership of the buffer slice you provide. This is a deliberate design choice that gives you complete flexibility.
Here’s how you might create queues with different allocation strategies:
// Stack allocation
var buffer: [256]MyType = undefined;
var queue = SpscQueue(MyType).init(&buffer);
// Heap allocation (dynamic size, decided at runtime)
const buffer = try allocator.alloc(MyType, 1024);
defer allocator.free(buffer);
var queue = SpscQueue(MyType).init(buffer);
The queue doesn’t care—it just uses whatever buffer you give it. I believe this pattern is clear and flexible, especially in systems programming where control over memory is paramount.
The tryPush Operation: Producer’s Perspective
The enqueue operation is the core of the producer side. Let’s see the complete function first, then break down every line:
/// Try to push an item to the queue
/// Returns true if successful, false if queue is full
pub fn tryPush(self: *Self, item: T) bool {
// 1. Load our own position - we're the only writer, so unordered is fine
const producer_pos = self.producer_pos.load(.unordered);
const next_pos = producer_pos + 1;
// 2. Load consumer's position - need Acquire to synchronise
const consumer_pos = self.consumer_pos.load(.acquire);
// 3. Check if we would overflow the buffer
if (next_pos - consumer_pos > self.capacity) {
return false; // Queue is full
}
// 4. Write the item to the buffer (not atomic!)
self.buffer[producer_pos & self.mask] = item;
// 5. Publish our write with Release ordering
self.producer_pos.store(next_pos, .release);
return true;
}
This function follows a carefully orchestrated dance 🤓. Each step has a purpose, and the order matters. Let’s examine them one after the other.
Step 1: Reading Our Own Position
const producer_pos = self.producer_pos.load(.unordered);
We start by loading our current position. Notice the .unordered memory ordering (called “relaxed” in some other languages). Why is this safe?
Because we’re the only thread that ever writes to producer_pos. Reading our own position doesn’t need synchronisation—we just want the current value. The .unordered ordering is the cheapest option: it guarantees atomicity (we’ll read the whole value, not a torn read where we see bits from different writes) but nothing else. No memory barriers, no synchronisation overhead.
This is a key optimisation. We only pay for synchronisation when we actually need it, which is when reading the other thread’s position.
Step 2: Calculating the Next Position
const next_pos = producer_pos + 1;
Simple arithmetic. If we successfully push, this will be our new position. Notice we’re not applying the mask yet—we’re working with the unbounded position counter.
Step 3: Checking the Consumer’s Position
const consumer_pos = self.consumer_pos.load(.acquire);
This is where cross-thread synchronisation happens. The consumer updates consumer_pos with .release ordering after it finishes reading an item. Our .acquire load synchronises with that release, ensuring we see any updates the consumer has made.
Remember the checkpoint analogy from Article 2? The consumer plants a flag when it finishes with a slot. Our Acquire load reads that flag, guaranteeing we see the consumer’s completed work.
Step 4: Checking If the Queue Is Full
if (next_pos - consumer_pos > self.capacity) {
return false;
}
This is our full detection logic. If adding one more item would make producer_pos - consumer_pos exceed capacity, the queue is full and we can’t proceed.
Why does subtraction work with unbounded positions? Because unsigned arithmetic wraps correctly. If producer_pos = 1005 and consumer_pos = 1001, there are 4 items in the queue. If we’re full, we return false immediately. The caller can decide what to do—retry, drop the message, switch queues, whatever makes sense for their application.
Step 5: Writing the Item to the Buffer
self.buffer[producer_pos & self.mask] = item;
This is a normal memory write, not an atomic operation. Why is this safe?
Because only the positions need to be atomic. The data itself is protected by the protocol: we only write to slots we know are empty (we just checked above), and the consumer only reads slots it knows are filled (it will check producer_pos). No two threads ever access the same buffer slot simultaneously.
Notice the mask in action: & self.mask converts our unbounded position to an actual buffer index. If the capacity is 1024 (mask is 1023) and producer_pos is 5000, we write to slot 5000 & 1023 = 904. The circular buffer wraps around automatically.
Step 6: Publishing Our Write
self.producer_pos.store(next_pos, .release);
This is the critical publication step. The .release ordering guarantees that all our previous writes (specifically, the buffer write we just did) are visible to any thread that subsequently loads this value with .acquire.
This is where we are planting the checkpoint flag. We’re saying “I’ve finished writing item at position 5000, and that write is now published.” When the consumer does an Acquire load of producer_pos and sees 5001, it’s guaranteed to see our buffer write.
Here’s what the synchronisation looks like:
Producer Consumer (later)
──────── ────────
buffer[pos] = item ─┐
│ happens-before relationship
producer_pos.store ────►│
(.release) │
└────────────► producer_pos.load
(.acquire)
│
▼
item = buffer[pos]
"I see the new data"
The tryPop Operation: Consumer’s Perspective
The consumer side mirrors the producer beautifully. It follows the same pattern in reverse. Here’s the complete function:
/// Try to pop an item from the queue
/// Returns null if queue is empty
pub fn tryPop(self: *Self) ?T {
// 1. Load our own position - we're the only writer to consumer_pos
const consumer_pos = self.consumer_pos.load(.unordered);
// 2. Load producer's position - need Acquire to see their writes
const producer_pos = self.producer_pos.load(.acquire);
// 3. Check if queue is empty
if (consumer_pos == producer_pos) {
return null;
}
// 4. Read the item from the buffer
const item = self.buffer[consumer_pos & self.mask];
// 5. Publish that we've consumed this slot
self.consumer_pos.store(consumer_pos + 1, .release);
return item;
}
The symmetry is striking. Let’s walk through it.
Step 1: Reading Our Own Position
const consumer_pos = self.consumer_pos.load(.unordered);
Same pattern as the producer. Reading our own position with .unordered because we’re the only writer to this variable. No synchronisation needed, no performance overhead.
Step 2: Checking the Producer’s Position
const producer_pos = self.producer_pos.load(.acquire);
This .acquire load synchronises with the producer’s .release store. After this load, we’re guaranteed to see the buffer data that the producer wrote before updating its position.
This is the consumer reading the checkpoint flag that the producer planted. If we see producer_pos = 5001, we know position 5000 has been written and is safe to read.
Step 3: Detecting an Empty Queue
if (consumer_pos == producer_pos) {
return null;
}
If we’ve caught up to the producer, there’s nothing to read. We return null and let the caller decide what to do. Maybe spin and retry immediately, maybe sleep, maybe switch to processing other work.
Step 4: Reading the Item
const item = self.buffer[consumer_pos & self.mask];
A normal memory read, not atomic. The .acquire load above guaranteed that this data is visible to us. We apply the mask to convert the unbounded position to a buffer index, just like the producer did when writing.
Step 5: Publishing Our Consumption
self.consumer_pos.store(consumer_pos + 1, .release);
This tells the producer “I’ve finished with this slot.” When the producer does its .acquire load of consumer_pos, it will know this slot is safe to reuse.
The Release ordering ensures that our buffer read is complete before we update the position. Not that it matters much here—we’re reading, not writing, so there’s no data corruption risk—but it maintains the protocol’s symmetry and keeps the reasoning simple.
The Complete Synchronisation Picture
Here’s how both threads coordinate through the full cycle:
Producer Consumer
──────── ────────
consumer_pos.load (.unordered)
consumer_pos.load (.acquire) ◄─────── consumer_pos.store (.release)
│ ▲
│ "I see consumer is done with slot X" │
│ │
▼ │
buffer[X] = new_item │
producer_pos.store (.release) ─────► producer_pos.load (.acquire)
│
▼
item = buffer[X]
"I see the new data"
The elegance lies in the symmetry. Both threads follow the same pattern:
- Read own position (unordered)
- Read other thread’s position (acquire)
- Do work (buffer read or write)
- Update own position (release)
This emerges naturally from the Acquire/Release protocol: every time you update something another thread reads, use Release. Every time you read something another thread updates, use Acquire.
Now let’s look at some additional utility methods.
The Approximate Methods: Monitoring Without Guarantees
Sometimes you want to know the queue’s size for monitoring or debugging. Maybe you’re displaying statistics on a dashboard, or logging queue depth for capacity planning. We provide three methods for this, but they come with an important caveat.
/// Returns an approximate count of items currently in the queue.
/// WARNING: This is a point-in-time snapshot and may be stale immediately.
/// Do NOT use for synchronisation or control flow decisions.
pub fn sizeApprox(self: *const Self) usize {
const producer_pos = self.producer_pos.load(.acquire);
const consumer_pos = self.consumer_pos.load(.acquire);
return producer_pos - consumer_pos;
}
/// Returns whether the queue appears to be empty.
/// Use `tryPop() == null` for reliable empty detection in the consumer.
pub fn isEmptyApprox(self: *const Self) bool {
const producer_pos = self.producer_pos.load(.acquire);
const consumer_pos = self.consumer_pos.load(.acquire);
return producer_pos == consumer_pos;
}
/// Returns whether the queue appears to be full.
/// Use `tryPush() == false` for reliable full detection in the producer.
pub fn isFullApprox(self: *const Self) bool {
const producer_pos = self.producer_pos.load(.acquire);
const consumer_pos = self.consumer_pos.load(.acquire);
return (producer_pos + 1) - consumer_pos > self.capacity;
}
These methods are useful, but only in specific contexts.
Good uses:
- Metrics and monitoring dashboards showing “queue depth: ~50 items”
- Debug logging to help understand system behaviour
- Rough capacity planning decisions
- Alerting when queue approaches capacity (with appropriate hysteresis)
Bad uses:
- Control flow decisions in your application logic
- Synchronisation between threads
- Assuming the queue state after checking
Think of these approximate methods like checking a stock price. By the time you see “AAPL: $150,” the price has already moved. You can use that information to understand trends and make decisions, but you can’t rely on that exact price being available when you place your order.
It’s wrong to use them to decide when to enqueue or dequeue items. For example:
// WRONG: Race condition!
if (!queue.isEmptyApprox()) {
const item = queue.tryPop().?; // Might still be null! Crash!
}
// WRONG: Race condition!
if (!queue.isFullApprox()) {
_ = queue.tryPush(item); // Might still fail!
}
The issue is that checking the state and then acting on that check are two separate operations. Between the check and the action, the state can change. Here’s a timeline showing the problem:
Observer thread Producer Consumer
─────────────── ──────── ────────
sizeApprox():
read producer_pos = 100
push item (pos → 101)
pop item (pos → 99)
read consumer_pos = 99
return 100 - 99 = 1
But the actual size was never 1!
- Before producer pushed: 100 - 98 = 2
- After consumer popped: 101 - 99 = 2
The returned value is a fiction.
The queue’s state is constantly changing. These methods give you a glimpse of what the state was at some recent moment, but that moment has already passed. You’re looking at history, not the present.
The only reliable way to know if an operation succeeded is to attempt it and check the result.
Here’s how to do it correctly:
// RIGHT: Let tryPop tell you
if (queue.tryPop()) |item| {
// Definitely have an item
}
// RIGHT: Let tryPush tell you
if (!queue.tryPush(item)) {
// Handle backpressure appropriately
}
Testing for Correctness
Let’s verify our implementation works. We’ll start with single-threaded tests to check basic behaviour, then move to concurrent access to ensure our synchronisation is correct.
Basic Operations Test
test "SpscQueue basic operations" {
const testing = std.testing;
var buffer: [8]u32 = undefined;
var queue = SpscQueue(u32).init(&buffer);
// Initially empty
try testing.expect(queue.isEmptyApprox());
try testing.expect(queue.sizeApprox() == 0);
try testing.expect(queue.tryPop() == null);
// Push items
try testing.expect(queue.tryPush(1));
try testing.expect(queue.tryPush(2));
try testing.expect(queue.tryPush(3));
try testing.expect(!queue.isEmptyApprox());
try testing.expect(queue.sizeApprox() == 3);
// Pop items - verify FIFO order
try testing.expect(queue.tryPop() == 1);
try testing.expect(queue.tryPop() == 2);
try testing.expect(queue.tryPop() == 3);
try testing.expect(queue.tryPop() == null);
try testing.expect(queue.isEmptyApprox());
}
This test verifies the fundamentals. The queue starts empty, items go in and come out in FIFO order, and the queue correctly reports its state. Nothing fancy, just making sure the basics work.
Wraparound Test
This verifies the circular buffer behaviour works correctly when positions wrap around in the underlying buffer:
test "SpscQueue wraparound" {
const testing = std.testing;
var buffer: [4]u32 = undefined;
var queue = SpscQueue(u32).init(&buffer);
// Fill the queue completely
try testing.expect(queue.tryPush(10));
try testing.expect(queue.tryPush(20));
try testing.expect(queue.tryPush(30));
try testing.expect(queue.tryPush(40));
try testing.expect(!queue.tryPush(50)); // Should be full
// Pop some items
try testing.expect(queue.tryPop() == 10);
try testing.expect(queue.tryPop() == 20);
// Push more (should wrap around in the underlying buffer)
try testing.expect(queue.tryPush(50));
try testing.expect(queue.tryPush(60));
try testing.expect(!queue.tryPush(70)); // Full again
// Verify remaining items in correct order
try testing.expect(queue.tryPop() == 30);
try testing.expect(queue.tryPop() == 40);
try testing.expect(queue.tryPop() == 50);
try testing.expect(queue.tryPop() == 60);
try testing.expect(queue.tryPop() == null);
}
This test exercises the mask logic. When we pop items 10 and 20, slots 0 and 1 become free. When we push 50 and 60, they reuse those slots. The positions keep growing (4, 5, 6…), but the mask converts them to buffer indices (0, 1, 2…). This is the circular buffer in action.
The Concurrent Test: Finding Race Conditions
This is the real validation. We spawn two threads and pass thousands of items between them. If our synchronisation is wrong, this test will catch it (eventually).
test "SpscQueue concurrent access" {
const testing = std.testing;
var buffer: [1024]u32 = undefined;
var queue = SpscQueue(u32).init(&buffer);
const num_items = 1000;
// Producer thread function
const producer = struct {
fn run(q: *SpscQueue(u32), n: usize) void {
for (0..n) |i| {
// Keep trying until we successfully push the item
while (!q.tryPush(@intCast(i))) {
// Spin until we can push - hint to CPU we're waiting
std.atomic.spinLoopHint();
}
}
}
}.run;
// Consumer thread function
const consumer = struct {
fn run(q: *SpscQueue(u32), n: usize) void {
var received: usize = 0;
while (received < n) {
if (q.tryPop()) |item| {
// Verify item is within expected range
std.debug.assert(item < n);
received += 1;
} else {
// Queue empty - hint to CPU we're spinning
std.atomic.spinLoopHint();
}
}
}
}.run;
var producer_thread = try std.Thread.spawn(.{}, producer, .{ &queue, num_items });
var consumer_thread = try std.Thread.spawn(.{}, consumer, .{ &queue, num_items });
producer_thread.join();
consumer_thread.join();
try testing.expect(queue.isEmptyApprox());
}
Notice the spinLoopHint() calls. This tells the CPU “I’m in a spin loop waiting for something.” The CPU can use this hint to reduce power consumption or yield resources to hyperthreads. It’s a polite way to busy-wait.
In a real production system, you probably wouldn’t spin like this. You might yield the thread to the scheduler, use a condition variable, or batch work to reduce contention. But for testing, spinning is fine.
What this test validates:
- No lost items (consumer receives all 1,000)
- No data corruption (values are within expected range)
- No crashes from race conditions
- Queue ends up empty (all items consumed)
A Complete Working Example
Leaving test aside, let’s see how to use the queue in a realistic scenario. The example is going to be a logging system where the main thread should never block on I/O.
In high-performance applications, logging is a common bottleneck. You want to record what’s happening, but writing to disk is slow—potentially taking milliseconds. If your main thread blocks on every log statement, your application grinds to a halt.
The solution is to decouple logging from main thread execution. The main thread pushes log messages to a lock-free SPSC queue, and a dedicated writer thread consumes those messages and writes them to disk or stdout asynchronously.
Here’s what how we might implement this in Zig:
const std = @import("std");
const LogLevel = enum { debug, info, warn, err };
const LogMessage = struct {
level: LogLevel,
message: [256]u8,
len: usize,
timestamp: i64,
};
pub fn main() !void {
// Set up the queue
var buffer: [256]LogMessage = undefined;
var log_queue = SpscQueue(LogMessage).init(&buffer);
var shutdown = std.atomic.Value(bool).init(false);
// Spawn the log writer thread (consumer)
const writer_thread = try std.Thread.spawn(.{}, logWriter, .{ &log_queue, &shutdown });
// Main application (producer)
logMessage(&log_queue, .info, "Application started");
// Simulate work
for (0..100) |i| {
logMessage(&log_queue, .debug, "Processing item");
// Simulate some work
std.time.sleep(10 * std.time.ns_per_ms);
if (i % 25 == 0) {
logMessage(&log_queue, .info, "Progress checkpoint");
}
}
logMessage(&log_queue, .info, "Application finished");
// Signal shutdown and wait
shutdown.store(true, .release);
writer_thread.join();
}
fn logMessage(queue: *SpscQueue(LogMessage), level: LogLevel, msg: []const u8) void {
var entry: LogMessage = .{
.level = level,
.message = undefined,
.len = @min(msg.len, 256),
.timestamp = std.time.timestamp(),
};
@memcpy(entry.message[0..entry.len], msg[0..entry.len]);
// Non-blocking push - if queue is full, drop the message
// In production, you might want different backpressure strategies
_ = queue.tryPush(entry);
}
fn logWriter(queue: *SpscQueue(LogMessage), shutdown: *std.atomic.Value(bool)) void {
const stdout = std.io.getStdOut().writer();
while (true) {
if (queue.tryPop()) |entry| {
const level_str = switch (entry.level) {
.debug => "DEBUG",
.info => "INFO ",
.warn => "WARN ",
.err => "ERROR",
};
stdout.print("[{d}] {s}: {s}\n", .{
entry.timestamp,
level_str,
entry.message[0..entry.len],
}) catch {};
} else if (shutdown.load(.acquire)) {
// Queue empty and shutdown requested
break;
} else {
std.atomic.spinLoopHint();
}
}
}
The main thread never blocks on I/O. It just pushes to the queue and continues. If the I/O thread falls behind and the queue fills up, we drop messages rather than blocking. This keeps the main thread’s latency predictable.
The dedicated writer thread processes messages at whatever pace it can achieve. If writing to stdout is slow, messages queue up. If it’s fast, the queue stays mostly empty. The two threads coordinate through the queue without either one blocking the other.
This is exactly how high-performance logging libraries work. The “hot path” (your application code) stays fast, whilst the “slow path” (disk I/O) happens on a dedicated thread.
Handling Backpressure
You could extend this pattern to handle backpressure differently. Instead of dropping messages when the queue is full, you might:
- Write to a secondary overflow buffer
- Sample messages (keep every Nth one)
- Block the caller (trading latency for reliability)
The queue gives you the mechanism. The policy is up to you.
Adapting to Other Languages
The concepts we’ve covered translate directly to other systems languages. The syntax changes, but the fundamental pattern remains the same. Here’s a quick reference:
| Zig | Rust | C++ | Go |
|---|---|---|---|
std.atomic.Value(T) | AtomicUsize | std::atomic<T> | atomic.Uint64 |
.acquire | Ordering::Acquire | memory_order_acquire | N/A (Go supports only SeqCst) |
.release | Ordering::Release | memory_order_release | N/A |
.unordered | Ordering::Relaxed | memory_order_relaxed | N/A |
Some languages don’t give you fine-grained control over memory ordering, for example, JavaScript.
Go’s sync/atomic package only provides sequential consistency, not fine-grained ordering like Acquire/Release. For most Go code, I think using channels is good enough—they’re the idiomatic solution and handle the synchronisation for you.
The key insight is that these memory ordering concepts exist at the hardware level. They’re not language features—they’re CPU features that languages expose. Understanding them in one language translates directly to others.
We’ve built a lock-free SPSC queue after all the preamble to ensure the concept is clear. Now let’s summarise the key takeaways.
Key Takeaways
If you remember nothing else from this article, remember these principles:
Unbounded positions with masking simplifies full/empty detection. Let counters grow forever, apply the mask only when accessing the buffer. This makes the logic cleaner and easier to reason about.
Acquire/Release is a contract. Release publishes your writes, Acquire subscribes to see them. Every cross-thread update needs this pair. When you update something another thread reads, use Release. When you read something another thread updates, use Acquire.
Only positions need atomics. The data itself is protected by the protocol—we only write to empty slots, consumers only read from filled slots. No two threads ever touch the same buffer slot simultaneously.
Cache line alignment matters. That align(cache_line) attribute prevents false sharing, which can destroy performance.
Try-based APIs give control to callers. Methods return success or failure immediately. The caller decides how to handle backpressure—retry, drop, switch strategies. The queue doesn’t make those decisions.
What’s Next
You saw the align(cache_line) attribute on our position fields. We mentioned it briefly but didn’t explain it properly. That wasn’t an oversight, it’s topic that deserves its own article.
Sometimes lock-free code ends up slower than mutex-based alternatives, which seems to defeat the purpose. If that happens to you, the culprit is likely something called false sharing.
In the next article, we’ll explore cache coherence, false sharing, and why spatial locality matters in concurrent code. Want to be notified when the next article is published? Subscribe using the form below. And if you want to implement this in your preferred language, the concepts transfer directly—only the syntax changes.
One more thing, if you found this article useful, I’m sure you will value my workshop on mastering real-time and pub/sub messaging systems. In the workshop we will build a pub/sub messaging system from scratch. Sign up at protocol-zero.pmbanugo.me to secure your spot!