Skip to content

Retaining chunks while reading from fs stream appears to leak memory rapidly #21967

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
ChALkeR opened this issue Jul 25, 2018 · 12 comments
Closed
Labels
buffer Issues and PRs related to the buffer subsystem. fs Issues and PRs related to the fs subsystem / file system. memory Issues and PRs related to the memory management or memory footprint.

Comments

@ChALkeR
Copy link
Member

ChALkeR commented Jul 25, 2018

  • Version: v10.7.0, v8.11.3
  • Platform: Linux yoga 4.17.2-1-ARCH #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC 2018 x86_64 GNU/Linux
  • Subsystem: fs, buffer

While trying to resolve Gzemnid memory problems at nodejs/Gzemnid#18, I eventually reduced those to the following testcase. It seems to me that it's not node-lz4 fault, but something is wrong on the Node.js side.

Testcase:

'use strict';

const fs = require('fs');

fs.writeFileSync('test-small.bin', Buffer.alloc(100));
fs.writeFileSync('test-large.bin', Buffer.alloc(100000));

const maxLength = 4 * 1024 * 1024; // 4 MiB
let built = 0, buf = [], length = 0;

function tick() {
  built++;
  if (built % 1000 === 0) {
    console.log(`RSS [${built}]: ${process.memoryUsage().rss / 1024 / 1024} MiB`);
  }
  const stream = fs.createReadStream('./test-small.bin');
  stream.on('data', function (data) {
    //data = Buffer.from(data); // WARNING: uncommenting this line fixes things somehow
    buf.push(data)
    length += data.length
    if (length >= maxLength) {
      buf = []
      length = 0
    }
  });
  stream.once('end', tock);
}

function tock() {
  fs.readFileSync('./test-large.bin');
  setImmediate(tick);
  /* Alternatively:
  const stream = fs.createReadStream('./test-large.bin');
  stream.on('data', function() {});
  stream.once('end', tick);
  */
}

tick();

Adding data = Buffer.from(data); fixes things somehow, problems start when the exact same chunks from the stream are retained for some time and some larger file reading goes on.

gc-ing manually does not help — this looks like a memory leak.
All that memory is allocated through node::ArrayBufferAllocator::Allocate.

/cc @addaleax @nodejs/buffer

@ChALkeR ChALkeR added buffer Issues and PRs related to the buffer subsystem. fs Issues and PRs related to the fs subsystem / file system. memory Issues and PRs related to the memory management or memory footprint. labels Jul 25, 2018
@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

It might be caused by the fact that every n-th allocated pooled buffer chunk is retained in this scenario, effectively retaning the whole underlying buffers, whie needing only a small chunk of it actually.

@addaleax
Copy link
Member

@ChALkeR Yes, looking at Buffer pooling would be my first guess, too. I’m not sure how they would be retained indefinitely, though, because you do reset buf = []?

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

@addaleax Shouldn't be caused by js-side buffer pooling as the second file has chunks larger than poolsize. Also, I was yet unable to remove the dependency on fs — trying to allocate buffers manually (or even pass the one obtained from fs through Buffer.from) destroys the leak.

Updated testcase which removes the second file:

'use strict';

const fs = require('fs');

fs.writeFileSync('test-small.bin', Buffer.alloc(100));

const maxLength = 4 * 1024 * 1024; // 4 MiB
let built = 0, buf = [], length = 0;

function tick() {
  built++;
  if (built % 1000 === 0) {
    console.log(`RSS [${built}]: ${process.memoryUsage().rss / 1024 / 1024} MiB`);
  }
  const stream = fs.createReadStream('./test-small.bin');
  stream.on('data', function (data) {
    //data = Buffer.from(data); // WARNING: uncommenting this line fixes things somehow
    buf.push(data)
    length += data.length
    if (length >= maxLength) {
      buf = []
      length = 0
    }
  });
  stream.once('end', tock);
}

function tock() {
  Buffer.alloc(65536);
  setImmediate(tick);
}

tick();

@addaleax
Copy link
Member

@ChALkeR I think it’s not about Buffer’s pooling, I think it’s about fs stream buffer pooling – that would be consistent with what I can tell from inspecting a heap dump, where every single chunk retains its own backing pool.

In particular, this line seems to be not ideal:

pool.used += toRead;

This adjusts the pool offset by what we planned to read, which could be the whole pool size, not what we actually read.

@addaleax
Copy link
Member

I think it’s a good sign that this seems to “solve” the problem:

diff --git a/lib/internal/fs/streams.js b/lib/internal/fs/streams.js
index f527b1de4b84..f2cda91c6873 100644
--- a/lib/internal/fs/streams.js
+++ b/lib/internal/fs/streams.js
@@ -171,6 +171,8 @@ ReadStream.prototype._read = function(n) {
       this.emit('error', er);
     } else {
       let b = null;
+      if (start + toRead === thisPool.used)
+        thisPool.used += bytesRead - toRead;
       if (bytesRead > 0) {
         this.bytesRead += bytesRead;
         b = thisPool.slice(start, start + bytesRead);

It only works if different streams don’t concurrently use the same pool, though, so I’m not sure it’s something we’d want to go with.

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

More details about the usecase: I needed to concatenate a large number of files into a small number of files with compression. Something like this schematic code:

cat a*.txt | lz4 > a.txt.lz4
cat b*.txt | lz4 > b.txt.lz4

In Gzemnid, I created several lz4 encoder streams (using node-lz4), piped those to output files, and upon encoutering a file to read — piped it into one of those lz4 streams (depending on the file).

Now, the thing is that node-lz4 uses block compression and retains everything thrown into it until a sufficient input size is reached (which is 4 MiB by default) — and that data effectively comes from fs.createReadStream in this scenario, which got hit by this issue.

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

@addaleax What is also interesting is how exactly does Buffer.alloc(65536) at the bottom affect the memory usage. This is the reduced testcase without fs:

'use strict';

const maxLength = 4 * 1024 * 1024; // 4 MiB
let built = 0, buf = [], length = 0;

function tick() {
  built++;
  if (built % 1000 === 0) {
    console.log(`RSS [${built}]: ${process.memoryUsage().rss / 1024 / 1024} MiB`);
  }
  let data = Buffer.allocUnsafe(65536).slice(0, 128).fill(0);
  //data = Buffer.from(data); // uncommenting this line fixes things
  buf.push(data)
  length += data.length
  if (length >= maxLength) {
    buf = []
    length = 0
  }
  Buffer.alloc(65536); // commenting this line also significantly reduces memory usage
  setImmediate(tick);
}

tick();

High memory usage is expected though. Probably something related to memory management.

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

@addaleax I guess that this is what's going on (re: how Buffer.alloc() affects things):

  1. When we have just the first part, Buffer.allocUnsafe (for the pool) allocates memory, but does not use it yet. Only a small chunk is used and this does not hit RSS immediately.
  2. When simultaneous Buffer usage is going on, it allocates, fills and uses a memory chunk, but quickly disposes of it. That chunk gets reused for the fs pool allocation (Buffer.allocUnsafe), which makes the underlying pool retain the actually claimed memory.

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

Testcase for that:

'use strict';

const bs = 1024 * 1024; // 1 MiB
const retained = [];
let i = 0, flag = false;

function tick() {
  i++;
  if (i % 1000 === 0) {
    console.log(`RSS [${i}]: ${process.memoryUsage().rss / 1024 / 1024} MiB`);
  }
  const buf = Buffer.allocUnsafe(bs);
  retained.push(buf);
  if (i === 20000) {
    console.log('Clearing retained and enabling alloc');
    retained.length = 0;
    flag = true;
  }
  if (flag) Buffer.alloc(bs); // Even Buffer.alloc(bs - 10) seems to be fine here
  if (i < 40000) setImmediate(tick);
}

tick();

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

The above (alloc affecting allocUnsafe memory usage) seems to be the underlying malloc behavior:

// Compile with -O0

#include <iostream>
#include <unistd.h>

using namespace std;

int main() {
  const int bs = 1024 * 1024; // 1 MiB
  const int count = 1000;
  void * x[count];
  for (int i = 0; i < count; i++) {
    free(calloc(bs * 2, 1)); // Commenting out this line reduces memory usage
    x[i] = malloc(bs);
  }
  cout << "Allocated" << endl;
  cout << "Sleeping..." << endl;
  sleep(20);
  cout << "Freeing" << endl;
  for (int i = 0; i < count; i++) {
    free(x[i]);
  }
  cout << "Hello" << endl;
  return 0;
}

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

@addaleax Btw, testcases from #21967 (comment), #21967 (comment) and #21967 (comment) (but not the initial one) have significantly less memory usage when LD_PRELOAD=/usr/lib64/libjemalloc.so is used.

Why don't we use jemalloc, btw?

@addaleax
Copy link
Member

@ChALkeR The question has come up a few times, but I don’t know that we’ve ever done a full investigation into the benefits/downsides. I don’t think the examples here are something that we should base a decision on; that the OS may lazily populate memory pages is not really something for Node to care about, imo.

targos pushed a commit that referenced this issue Jul 31, 2018
Fixes: #21967

PR-URL: #21968
Reviewed-By: Сковорода Никита Андреевич <chalkerx@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Trivikram Kamat <trivikr.dev@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
ChALkeR added a commit to nodejs/Gzemnid that referenced this issue Aug 1, 2018
This hack should be reverted once
nodejs/node#21967 gets included into a
Node.js v8.x release.

Fixes: #18
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
buffer Issues and PRs related to the buffer subsystem. fs Issues and PRs related to the fs subsystem / file system. memory Issues and PRs related to the memory management or memory footprint.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants