The Tokio/Rayon Trap and Why Async/Await Fails Concurrency
Over the last decade, async/await won the concurrency wars because it is exceptionally easy. It allows developers to write asynchronous code that looks virtually identical to synchronous code.
But beneath that familiar syntax lies massive structural complexity. It hides control flow, obscures hardware realities, and ultimately pushes the burden of scheduling back onto the developer.
Rich Hickey articulated this perfectly in his talk Simple Made Easy: “Easy” is what is familiar and close at hand, while “Simple” is what is structurally untangled1. async/await is easy to write, but it is fiercely complex to operate.
Rob Pike talked about this architectural shift during his 2023 GopherConAU address:
Compared to goroutines, channels and select, async/await is easier and smaller for language implementers to build… But it pushes some of the complexity back on the programmer, often resulting in what Bob Nystrom has called ‘colored functions’. […] It’s important, though, whatever concurrency model you do provide, you do it exactly once, because an environment providing multiple concurrency implementations can be problematic.2
Pike’s remark about “multiple concurrency implementations” and async/await is exactly what is failing in production today.
The Production Trap: Confusing Asynchrony with Concurrency
The fundamental trap of async/await is that it conflates asynchrony (yielding while waiting for I/O) with concurrency (doing multiple things at once).
The syntax is a trap because it disguises interleaved state machines as isolated, sequential threads. Lulled by this illusion, a developer writes an async function exactly as they would blocking code — fetching a database record over the network, then immediately crunching the data. But what happens when that data crunching involves parsing a 10MB JSON payload, traversing a massive collection, or executing a compute-heavy cryptographic proof?
The cooperative executor halts.
In a cooperative runtime like Rust’s Tokio or Node.js, the thread does not yield until it hits an await point. A 50-millisecond CPU-bound task in a function stalls the entire execution thread. Suddenly, thousands of unrelated network requests spike in latency and the system becomes unresponsive. Meanwhile the hardware is barely utilised.
The Broken Promise: Human in the loop Scheduler
When these latency spikes occur, the answer is always the same: separate your runtimes. Use Tokio for I/O, and send CPU-bound work to a dedicated thread pool like Rayon.
Recent postmortems highlight the resulting disaster. Engineering teams at PostHog3 and Meilisearch4 have documented the painful reality of untangling these complexities in production. Developers must carefully analyse every function to decide if it belongs in the “I/O pool” or the “Compute pool,” and then manually orchestrate the message-passing boundary between them.
If a developer must manually partition I/O and compute, strictly police the boundaries to prevent deadlocks, and ferry data between two different runtimes with two different mental models, the async abstraction has failed. The language feature promised to hide the complexity of concurrency. Instead, it turned the application developer into the human in the loop scheduler.
Unbounded by Default is OOM by Default
The second failure mode of async/await runtimes is how frictionless they make unbounded capacity.
Calling tokio::spawn(...) is cheap. When a downstream database slows down during a traffic spike, the ingress network loop happily continues accepting connections and spawning tasks. Because async tasks and memory allocations are typically unbounded by default in these ecosystems, the system does not push back.
In-flight tasks queue indefinitely. The application consumes RAM until the OS out-of-memory (OOM) killer violently terminates the process. Postmortems from major platforms consistently reveal the same root cause: queues do not fix overload, they simply delay the crash while making it catastrophic. Infinite capacity is a lie, and defaults that pretend otherwise are dangerous.
The Work-Stealing Myth
When systems hit these bottlenecks, developers often demand smarter, preemptive, work-stealing schedulers to distribute the load. The assumption is that if a core is idle, it should steal tasks from a busy core to guarantee fairness. But at massive scale, fairness is the enemy of throughput. Work-stealing destroys CPU cache locality.
When WhatsApp pushed the Erlang BEAM virtual machine to its limits on 100+ core machines, the system choked. As detailed by Robin Morisset, idle threads trying to steal work spent all their CPU cycles fighting over the global runq_lock5 — a lock used to synchronise access to a scheduler’s run queue.
Even with optimised locks, moving a state machine to a different CPU core means abandoning the L1 and L2 cache. Fairness does not matter if every stolen task incurs a 100+ nanosecond main memory fetch penalty. If you are already forced to manually partition threads for I/O versus CPU tasks to survive production, the generic work-stealing algorithm has already failed you. You understand your workload’s topology better than the runtime does.
The Alternative
I got tired of the complexities and traps of async/await. I wanted the rock-solid fault-tolerance of the BEAM, but without the opaque manoeuvres of garbage collection and global work-stealing. As Leslie Lamport has long argued, state machines are the mathematically sound foundation of concurrent programming. async/await is merely compiler magic that tries to hide the state machine from you, poorly.
Instead of hiding the state machine, why not expose it and give the user better control primitives? The result is Project Tina: an opinionated, shared-nothing, thread-per-core concurrency framework.
Tina embraces strict constraints to guarantee massive throughput and reliability:
- One Primitive. One Mental Model. There is no
asyncorawait, no Promises, and no Futures. You write an Isolate — a unit of concurrent work. The handler is a standard, synchronous function that reacts to a message and returns an Effect. - Thread-Per-Core (Shared Nothing). Tina shards the workload across OS threads. There is no work stealing. Isolates never migrate. All cross-core communication occurs via the messaging subsystem.
- Strictly Bounded. Memory is pre-allocated at process boot. Mailboxes are strictly bounded. If a traffic spike hits and a mailbox is full, the caller is notified immediately. The system sheds load predictably rather than OOM-crashing the process.
- Architectural Determinism. In modern async runtimes, task polling order and thread-pool scheduling are opaque, non-deterministic sources of chaos. You rarely know exactly when or where your task will wake up. Tina strips this away. The scheduler is a strict, visible, single-threaded loop per core. Because the framework explicitly controls execution order, I/O, and the clock, the system’s behaviour is radically predictable. This unlocks Tina’s ultimate superpower: Deterministic Simulation Testing (DST). You can simulate network partitions or dropped messages on a single thread, and the same seed will yield the exact same execution order, every single time.
Wrap Up
async/await makes concurrency easy to write, but it makes systems complex to operate. By forcing developers to explicitly manage state transitions, strict memory bounds, and deliberate architectural topologies upfront, I replace runtime magic with structural guarantees. Because;
Predictability beats brevity.
Tina is open source. You can view the architecture, read the design documents, and critique the code on GitHub.
Notes & References
Footnotes
-
Rich Hickey, “Simple Made Easy”, Strange Loop 2011. ↩
-
Rob Pike, “What We Got Right, What We Got Wrong”, GopherConAU 2023. ↩
-
PostHog Engineering, “Untangling Rayon and Tokio”, posthog.com/blog. ↩
-
Louis Dureuill, “Don’t mix Rayon and Tokio”, blog.dureuill.net. ↩
-
Robin Morisset, “Optimizing the BEAM’s Scheduler for Many-Core Machines”, Code BEAM Europe ↩