Node.js Performance: Processing 14GB Files 78% Faster with Buffer Optimization


It all started on a quiet weekend. My wife and son were away, leaving me with a rare block of uninterrupted time, fueled by curiosity and a slightly obsessive streak. As a freelancer who spends my days helping companies wring more speed out of their web apps, Node.js services, and CI/CD pipelines, I usually tackle performance as part of someone else’s puzzle. But this time, I decided to take on a challenge purely for the thrill.

The Crime Scene: 14.80 GB of weather data
The Evidence: 1 billion rows of temperature measurements
The Victim: My MacBook Pro M1’s dignity Time of Death: 5 minutes, 49 seconds
My Mission: Make it faster. Much, much faster.

What followed was a series of experiments, dead ends, and small breakthroughs that felt like uncovering clues in a complex mystery. I moved from single-threaded runs to considering Node.js worker threads, juggled data structures in Node.js, and chased down the hidden costs of syscalls. Along the way, I even tested the same code on Deno and Bun, only to find that the promises of faster runtimes were mostly a mirage for this particular workload.

By the end of the weekend, what started as idle curiosity turned into an 78% speedup—but the story of how I got there is far more interesting than the numbers alone.

The Challenge: Processing 1 Billion Rows of Text in Node.js

The data was deceptively simple: a text file containing 1 billion rows of temperature measurements from weather stations around the world. Each row followed a strict format: <string: station name>;<double: measurement>, with the temperature having exactly one fractional digit.

The task? Read the file, calculate the min, mean, and max temperature for each station, then write the sorted results to a new file.

Simple, right?

Establishing the Baseline: readline() and createReadStream() for Large File Processing

I started with what any reasonable developer would write: a straightforward Node.js script using readline to process the file line by line. Split each line on the semicolon, parse the temperature as a float, update the statistics in a JavaScript object. Clean, readable, obvious.

const stations = {};

rl.on("line", (line) => {
  if (line) {
    const [station, tempStr] = line.split(";");
    const temperature = parseFloat(tempStr);
    if (!stations[station]) {
      stations[station] = {
        min: temperature,
        max: temperature,
        sum: temperature,
        count: 1,
      };
    } else {
      const stationData = stations[station];
      stationData.min = Math.min(stationData.min, temperature);
      stationData.max = Math.max(stationData.max, temperature);
      stationData.sum += temperature;
      stationData.count++;
    }
  }
});

It worked. The results were correct. And it took 5 minutes and 49 seconds to complete.

In performance terms, that’s an eternity. I knew something had to give.

The First Lead: A Theory About Strings

Before reaching for the obvious solution—parallelization with worker threads—I wanted to exhaust the single-threaded approach. Call it professional pride, but I suspected there was more performance hiding in plain sight.

I formed a theory: the file was UTF-8 encoded, but JavaScript strings are UTF-16, requiring an expensive encoding conversion for every line. That meant every line required an encoding conversion. A billion times. Every split operation, every string allocation, every parsed float—they all had a cost.

What if I could work directly with the bytes?

I prompted GitHub Copilot with my hypothesis, and what came back was intriguing. Instead of decoding strings, the new approach parsed raw bytes in a streaming fashion:

const readableStream = fs.createReadStream(inputFilePath);

let leftover = null;
let lineStart = 0;
let stationName = null;
let state = 0; // 0 = reading station, 1 = reading temperature
let tempSign = 1;
let tempInt = 0; // Integer tenths accumulator

readableStream.on("data", (chunk) => {
  const buffer = leftover ? Buffer.concat([leftover, chunk]) : chunk;
  const bufferLength = buffer.length;

  for (let byteIndex = lineStart; byteIndex < bufferLength; byteIndex++) {
    const currentByte = buffer[byteIndex];

    if (state === 0) {
      if (currentByte === SEMI) {
        // ';'
        stationName = buffer.toString("utf8", lineStart, byteIndex);
        state = 1;
        tempSign = 1;
        tempInt = 0;
      }
      continue;
    }

    // Temperature parsing with fast-path digit accumulation
    if (currentByte === NEWLINE) {
      const tempTenths = tempSign * tempInt;
      let stationData = stations.get(stationName);
      if (stationData) stationData.update(tempTenths);
      else stations.set(stationName, new StationData(tempTenths));

      stationName = null;
      state = 0;
      lineStart = byteIndex + 1;
      continue;
    }

    if (currentByte === DASH && tempInt === 0) {
      tempSign = -1;
      continue;
    }
    if (currentByte === DOT) continue;

    // Direct byte-to-digit conversion
    tempInt = tempInt * 10 + (currentByte - ZERO);
  }
});

The clever part? It stored temperatures as integers: 25.3°C became 253 tenths of a degree. No floating-point math until the very end. The parser also switched from using a plain object to a Map for better memory patterns, and added a StationData class for tighter updates.

But the real magic was in this single line:

tempInt = tempInt * 10 + (currentByte - ZERO);

Each digit byte (say, 52 for the character 4) gets converted to its integer value by subtracting the byte value of 0 (which is 48). So 52 - 48 = 4. That digit accumulates into tempInt by multiplying the current value by 10 and adding the new digit. For a temperature like 12.3:

  • '1' (byte 49) → tempInt = 0 * 10 + (49-48) = 1
  • '2' (byte 50) → tempInt = 1 * 10 + (50-48) = 12
  • '.' (byte 46) → skipped
  • '3' (byte 51) → tempInt = 12 * 10 + (51-48) = 123

No string decoding. No floating-point arithmetic. Just raw bytes and integer math.

Time to test the theory.

To speed up iteration, I scaled down to 10 million rows (158MB) for testing. The original naive version took 3.5 seconds on this smaller dataset.

The optimized version? 1.8 seconds.

I stared at the terminal. Nearly 50% faster 🥳.

But a good detective knows when the case isn’t closed. I took a cup of hot chocolate, sat back, and onto the next lead.

The Second Lead: A False Trail (or Three)

The speedup was intoxicating, but I couldn’t shake a nagging thought: we were still decoding station names. Every single time we encountered a station, we converted its bytes to a UTF-16 string to use as a Map key. For stations that appeared millions of times, this was wasteful.

What if I could avoid decoding until absolutely necessary?

I chased three leads.

Lead #1: String Interning
Dead end—you have to decode the string first to intern it. That’s like filing evidence you haven’t collected yet.

Lead #2: Buffer Slices as Map Keys
This one seemed promising in theory. Use raw Buffer slices as keys instead of strings, avoiding conversion altogether. But JavaScript’s Map compares objects by reference, not content. Two identical buffers pointing to different memory addresses = two different keys = corrupted data. I briefly considered using JavaScript Symbols as a workaround—create a unique Symbol for each station and use that as the key. But wait… that still requires decoding the string first to check if it has seen this station before. Yet another dead end.

Lead #3: Custom Hash Map
I spent hours on this one. Built it from scratch, tested it thoroughly, and… it was slower than plain objects. Sometimes the murder weapon is your own cleverness. 😉

Then it hit me at 2 AM, staring at the code with bleary eyes: what if I hashed the buffer before converting to a string?

The Breakthrough: Hash First, Decode Later

The solution was elegant in its simplicity: compute a 32-bit hash while scanning for the semicolon. Use that hash as the Map key. Only convert the buffer to a string when absolutely necessary—at the very end, when writing output.

But there was a catch. Hash collisions could corrupt the data. Two different station names producing the same hash would merge their data, poisoning the results. I needed a failsafe: a collision resolution strategy.

Enter the linked list. Here’s how the new approach worked:

function processChunk(chunk) {
  const mergedBuffer = carryoverBuffer
    ? Buffer.concat([carryoverBuffer, chunk])
    : chunk;
  const bufferLength = mergedBuffer.length;
  let cursor = 0;

  while (cursor < bufferLength) {
    const stationNameStartIndex = cursor;

    // Compute DJB2 hash while scanning to ';'
    let stationHash = 5381 >>> 0;
    while (cursor < bufferLength) {
      const currentByte = mergedBuffer[cursor];
      if (currentByte === SEMI) break;
      stationHash = ((stationHash << 5) + stationHash + currentByte) >>> 0;
      cursor++;
    }

    if (cursor >= bufferLength) {
      carryoverBuffer = mergedBuffer.subarray(stationNameStartIndex);
      return;
    }
    const stationNameEndIndex = cursor;

    // Lookup by 32-bit hash; resolve collisions via linked list
    let stationEntry = stations.get(stationHash);
    if (stationEntry === undefined) {
      stationEntry = new StationEntry(
        mergedBuffer.subarray(stationNameStartIndex, stationNameEndIndex),
      );
      stations.set(stationHash, stationEntry);
    } else {
      // Traverse collision chain
      let currentNode = stationEntry;
      let previousNode = null;
      while (currentNode !== null) {
        if (
          buffersEqual(
            mergedBuffer,
            stationNameStartIndex,
            stationNameEndIndex,
            currentNode.name,
          )
        ) {
          stationEntry = currentNode;
          break;
        }
        previousNode = currentNode;
        currentNode = currentNode.nextInBucket;
      }
      if (currentNode === null) {
        // Not found; append new entry to chain
        const newStationEntry = new StationEntry(
          mergedBuffer.subarray(stationNameStartIndex, stationNameEndIndex),
        );
        previousNode.nextInBucket = newStationEntry;
        stationEntry = newStationEntry;
      }
    }

    // ... temperature parsing continues ...
  }
}

The buffersEqual() function handled collision resolution by comparing buffer lengths first, then bytes:

function buffersEqual(
  sourceBuffer,
  sliceStartIndex,
  sliceEndIndex,
  referenceBuffer,
) {
  const sliceLength = sliceEndIndex - sliceStartIndex;
  if (sliceLength !== referenceBuffer.length) return false;
  for (let offset = 0; offset < sliceLength; offset++) {
    if (sourceBuffer[sliceStartIndex + offset] !== referenceBuffer[offset])
      return false;
  }
  return true;
}

This approach deferred all string creation out of the hot parse loop—eliminating per-line string allocations and reducing GC pressure. After streaming ended, I’d walk all the chains once: materialize each Buffer to a string, aggregate results, sort, and write to file.

I held my breath and ran it.

821 milliseconds.

Down from 1,852.

I had to run it again to believe it. Another 2.25x speedup. Cumulatively, from my starting point, I was now at roughly 4.3x faster on the 10 million row dataset.

This felt like the limit for single-threaded optimization. Time to test on the full dataset.

The full 1 billion rows completed in 1 minute and 14 seconds—down from the original 5 minutes and 49 seconds. I experimented with increasing the read stream’s high water mark to see if larger chunks would help, but the gains were marginal. I settled on 256KB chunks as the sweet spot.

Testing the Alibi: Could Different Runtimes Be Faster?

Before closing the case, I had to rule out one possibility: maybe Node.js itself was the bottleneck?

Given the performance accolades of Deno and Bun, I ran the exact same code across all three runtimes. Here’s what happened:

  • Bun (1.2.15) 🥇: 1 minute and 5 seconds
  • Node.js (24.6.0) 🥈: 1 minute and 14 seconds
  • Deno (2.4.5) 🥉: 1 minute and 21 seconds

Bun edged ahead by about 9 seconds, but the differences weren’t dramatic. I was honestly surprised by Deno’s performance given its reputation for speed. The real insight here? The bottleneck was never the runtime. It was the algorithm all along.

This is why I’m skeptical of generic benchmarks and marketing claims. In the real world, your specific problem determines the winner. Node.js remains my daily driver—it’s stable, mature, has a vast ecosystem, and its performance is competitive for most applications. While Bun and Deno offer nice features like built-in TypeScript support, I haven’t experienced game-changing performance benefits in practice. Some Node.js modules (especially native addons) still don’t work well in these alternative runtimes despite their compatibility efforts.

The lesson here is to evaluate based on your use case, not salesy benchmark suites.

The Final Clue: What the Profiler Revealed

But here’s the thing about good detective work: you need evidence that stands up in court. I couldn’t just trust my timer—I needed to see inside the machine, to prove beyond doubt that we’d eliminated every bottleneck worth eliminating.

I started Node.js with --inspect-brk, connected Chrome DevTools, and hit record. For the next 60 seconds, I watched the flame graph paint a story in real-time, each colored bar a thread of execution, each spike a moment of computational intensity.

When it finished, I stared at the results.

The smoking gun was right there.

processChunk() consumed 50,722 milliseconds of self time—that’s 87% of the total execution time. Everything else? Noise. The wrapper function, garbage collection, even file I/O—they were all bit players in this performance drama.

I dove into the heap snapshots next, looking for memory leaks or bloat that might explain any remaining slowness:

  • StationEntry objects: 8,872 instances. Perfect. One for each weather station.
  • Buffer allocations: Modest and short-lived. The streaming approach kept memory pressure low.
  • Garbage collection activity: Barely a whisper. The pauses were so brief they didn’t even register as bottlenecks.

The evidence was conclusive. This wasn’t a memory crime—no bloated objects, no runaway allocations, no GC thrashing. This was a pure CPU bottleneck.

We had done it! We’d eliminated the string conversions, the redundant parsing, the hidden allocations. The hot path was as lean as JavaScript would allow. Every microsecond had been accounted for, every inefficiency hunted down and eliminated.

But here’s the thing about detective work: there’s always another case or clue. The profiler confirmed what I already suspected—we were CPU-bound. One detective working alone, no matter how efficient, can only move so fast.

The next breakthrough wouldn’t come from algorithmic cleverness or byte-level optimization. It would come from parallelization—splitting the work across multiple cores, multiple detectives working the same case simultaneously.

Worker threads could potentially cut this time in half again. Maybe more.

But that’s a case for another day.

Case Closed (For Now)

Final Time: 1 minute, 14 seconds
Starting Time: 5 minutes, 49 seconds
Speedup: 78% (4.7x faster)

What started as weekend curiosity became a masterclass in hunting bottlenecks. The clues were there all along: unnecessary string conversions, redundant decoding, the hidden costs of abstractions we take for granted.

But here’s what this case really taught me: performance optimization is detective work. You form theories, test them, follow false leads, and sometimes discover the answer was hiding in plain sight. The most effective solutions often require the least amount of change—if you’re looking in the right place.

Each optimization built on the last:

  • Byte-level parsing eliminated string overhead
  • Integer arithmetic replaced floating-point math
  • Hashing deferred string creation to the last possible moment

Small, incremental improvements that compounded into massive gains.

The Case Remains Open

Worker threads could push this even further—potentially cutting the time in half again by utilizing all CPU cores. There are other mysteries in this data, other patterns waiting to be found.

Are you chasing bottlenecks in your Node.js applications or deployment pipeline? I love a good mystery. Feel free to reach out via email or any of the social handles at the footer—I’d be happy to help you track down those performance criminals. I’m also open to technical writing opportunities if you know anyone looking for a writer who loves deep-diving into performance optimization.

Evidence Summary: What We Learned

  • 78% faster processing: Reduced 14.8GB file processing from 5:49 to 1:14 using buffer optimization techniques
  • Byte-level parsing eliminates overhead: Working directly with UTF-8 buffers instead of converting to UTF-16 strings delivered a 50% speedup
  • Integer arithmetic beats floats: Storing temperatures as integer tenths (253 instead of 25.3) avoided floating-point operations in the hot path
  • Hash-first, decode-later strategy: DJB2 hashing with collision chains deferred string creation until final output, achieving 2.25x additional speedup
  • CPU-bound, not I/O-bound: Profiling revealed 87% of execution time in processChunk()—streaming and buffer management were already optimal
  • Runtime differences are marginal: Bun (1:05), Node.js (1:14), and Deno (1:21) showed algorithm matters more than runtime choice for this workload
  • Further gains require parallelization with worker threads to utilize multiple CPU cores

Evidence Locker: The complete code on GitHub with detailed commit messages
Next Case: Multi-threaded optimization (TBD -> Become a subscriber to find out!)

If you decide to explore the worker threads approach or have questions about any of these techniques, let me know. The investigation continues… 🕵🏼‍♂️

Subscribe to newsletter

Subscribe to receive expert insights on high-performance Web and Node.js optimization techniques, and distributed systems engineering. I share practical tips, tutorials, and fun projects that you will appreciate. No spam, unsubscribe anytime.