[v1.7][v1.6] storage.total_limit_size is not working properly #2878

manojna10 · 2020-12-19T00:29:59Z

Bug Report

There are 2 issues when the output plugin is unable to forward the data so it needs buffering and storage, nullifying the use of disk for storage.

Issue 1: When storage.type is filesystem, and mem_buf_limit is x and storage.total_limit_size is y, and if y > x(mainstream case), once it reaches y with some chunks being UP(upto x memory) and remaining DOWN in fs, over the next few seconds, it slowly falls back to total disk size of only x and only have UP chunks. So it essentially stores the same amount of data in disk which is in memory.

Issue 2: when storage.type is filesystem and mem_buf_limit is not set but storage.total_limit_size is y and max_chunks_up are set to z, if y is reached before max up chunks reaches z, it works fine, i,e, we store max of y. But if max up chunks reaches z before total size reaches y, then new chunks are created as DOWN chunks, but it never stops at y, it continues to store the data beyond the configured size of y.

To Reproduce

Steps to reproduce the problem:
Added screen shots for Issue 1
Reproducibility: 100%
1 input plugin (lib) and 1 output plugin (syslog)
configured 10M mem_buf_limit and 30M storage.total_limit_size
it reaches 30M with 49 UP chunks and 93 DOWN chunks
then slowly goes back to 10M with only 47 UP chunks and it remains there.
Happens in both v1.7 and v1.6
basic api code that I used to reproduce consistently: https://pastebin.com/yD6kBx6T

Expected behavior
It should stay at 30M and not go back to 10M.

Screenshots
issue1_part1: just started FLB

issue1_part2: reached total_limit_size after crossing mem_buf_limit

issue1_part3: It goes back to mem_buf_limit size

Your Environment

Version used: v1.7, v1.6
Configuration:
Environment name and version (e.g. Kubernetes? What version?): Ubuntu
Server type and version:
Operating System and version:
Filters and plugins: 1 input plugin(lib), 1 output filter(syslog)

Additional context

manojna10 · 2020-12-20T06:12:27Z

concerns:

flb_input_chunk_find_space_new_data() doesnt account for the overlimit_routes_mask but rather looping through all the output instances again, is it not needed or its by design? anyways I am testing with 1 output plugin, so this should not be the potential reason for this issue.
valid chunks with lots of data(as shown below using du) are getting deleted in flb_input_chunk_find_space_new_data() as shown below in gdb:
16K 10323-1608342739.704776149.flb
476K 10323-1608342935.913228244.flb
504K 10323-1608342936.866337320.flb
504K 10323-1608342937.865617453.flb
504K 10323-1608342938.865773090.flb
500K 10323-1608342939.866210468.flb
500K 10323-1608342940.865507508.flb
496K 10323-1608342941.865974941.flb

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$30 = 0
$31 = 0x7ffff001ab00 "10323-1608342739.704776149.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$30 = 0
$31 = 0x7ffff0019eb0 "10323-1608342935.913228244.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$32 = 0
$33 = 0x7ffff001a170 "10323-1608342936.866337320.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$34 = 0
$35 = 0x7ffff0019450 "10323-1608342937.865617453.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$36 = 0
$37 = 0x7ffff00197e0 "10323-1608342938.865773090.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$38 = 0
$39 = 0x7ffff001a4f0 "10323-1608342939.866210468.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$40 = 0
$41 = 0x7ffff001a8a0 "10323-1608342940.865507508.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$42 = 507464
$43 = 0x7ffff001b190 "10323-1608342941.865974941.flb"

manojna10 · 2020-12-20T06:13:16Z

when we schedule a retry, if the chunk is up and if we are over the configured memory limit, we put the chunks down which calls munmap which sets the chunks data size to 0. When we loop through all the chunks to find some space when disk limit is reached, these down chunks were considered as 0 bytes and removed as well.
With my limited knowledge of the code I commented out the code that sets the data size to 0 and it seems to work fine for the first issue.

JeffLuoo · 2020-12-20T22:30:02Z

Hi @manojna10
For your two concerns #2878 (comment)

The first one I think might be the case I missed when implemented it, I will double-check it later. For the second concern could you please elaborate more on it? Because it might be the case that the program needs to drop some chunks to place the new chunk.

manojna10 · 2020-12-21T15:59:41Z

Issue 2:
storage.type is filesystem
mem_buf_limit is not set
storage.total_limit_size set to y
max_chunks_up set to z

case 1, working fine: If disk size reaches y before max up chunks reaches z: all chunks are up in memory and the newly added chunks by replacing the older UP chunks also remains UP. so, it works as expected i,e, we use max disk space of y.

case 2, having issue: But if max up chunks reaches z before total size reaches y, then new chunks are created as DOWN chunks, but it keeps adding new chunks even after disk size reaches y (didn't actually debug to see whether its discarding any old chunks or not, it will good to add a metric for that as well). the disk usage never stop never stops at configured y: it continues to store the data beyond the configured size of y.

sjentzsch · 2021-01-13T14:30:34Z

I can confirm storage.total_limit_size not being properly respected:

      [SERVICE]
          [...]
          storage.path /flb-storage/
          storage.sync normal
          storage.checksum off
          storage.max_chunks_up 2048
          storage.backlog.mem_limit 64M
          storage.metrics on
          [...]
      [OUTPUT]
          Name  es
          [...]
          storage.total_limit_size 30G

/flb-storage # du -sh tail.0/
48.6G    tail.0/

If we would have multiple such OUTPUT sections, the capacity would add up, right? If so, is there also any way to limit it globally for all (first-come-first-serve)?

JeffLuoo · 2021-01-13T18:10:49Z

Hi @sjentzsch The capacity should not be added up to a larger number. Each output plugin has its own limit set by storage.total_limit_size. Therefore it means that storage.total_limit_size is not a global limit. Could you please let me know how do you get a 48.6G tail.0/ with storage.total_limit_size 30G? Is elasticsearch your only output plugin?

manojna10 · 2021-01-13T21:12:30Z

This looks similar to the issue 2 I mentioned.

JeffLuoo · 2021-01-15T19:13:27Z

@manojna10 Thank you. I will try to reproduce it.

sjentzsch · 2021-01-15T19:28:30Z

@JeffLuoo We indeed had multiple output plugins to Elasticsearch (two or three), each with a 30G limit. So I have to agree, it could be that they have added up. However, I suspect that we rather ran into the issue described here, as usually only one of our es outputs accumulates Gigabytes of data (the others are rather silent). No proof though, unfortunately.

JeffLuoo · 2021-01-15T23:59:46Z

@sjentzsch Thank you. I will take a look.

JeffLuoo · 2021-01-16T00:47:27Z

@manojna10 Hi could you please share with me the file like https://pastebin.com/yD6kBx6T you used to reproduce issue 2 if possible? Appreciate that!

manojna10 · 2021-02-09T23:20:48Z

flb_client.txt

Hi @JeffLuoo I tried reproducing this issue. But regardless of how many times I tried, max_chunks_up configuration was not honored and there were no DOWN chunks at all even after the configured number of max_chunks_up is reached. Attaching the screenshot and the client file I used.

I even tried using #2804 to keep the number of busy chunks less than than the total number of up chunks, it didn't help with the above situation as well.

I used the latest Master code as well to see the behavior and its still the same. Not sure if I am missing anything here as it used to work before.

Because of this, I am unable to try reproduce the original issue 2 reported here.

Thanks,
Manoj

JeffLuoo · 2021-02-10T19:43:54Z

@manojna10 Thank you... I just created a PR for issue one #3054

github-actions · 2021-03-13T02:12:26Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2021-03-18T02:14:16Z

This issue was closed because it has been stalled for 5 days with no activity.

manojna10 changed the title ~~[v1.7][v1.6] storage.total_limit_size is not working~~ [v1.7][v1.6] storage.total_limit_size is not working properly Dec 19, 2020

JeffLuoo mentioned this issue Feb 1, 2021

[v1.6] storage.total_limit_size is throwing error #2688

Closed

JeffLuoo mentioned this issue Feb 10, 2021

input_chunk: Get real chunk size when calculating fs_chunks_size #3054

Merged

4 tasks

github-actions bot added the Stale label Mar 13, 2021

github-actions bot closed this as completed Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.7][v1.6] storage.total_limit_size is not working properly #2878

[v1.7][v1.6] storage.total_limit_size is not working properly #2878

manojna10 commented Dec 19, 2020 •

edited

Loading

manojna10 commented Dec 20, 2020

manojna10 commented Dec 20, 2020

JeffLuoo commented Dec 20, 2020

manojna10 commented Dec 21, 2020

sjentzsch commented Jan 13, 2021

JeffLuoo commented Jan 13, 2021 •

edited

Loading

manojna10 commented Jan 13, 2021

JeffLuoo commented Jan 15, 2021

sjentzsch commented Jan 15, 2021

JeffLuoo commented Jan 15, 2021

JeffLuoo commented Jan 16, 2021

manojna10 commented Feb 9, 2021

JeffLuoo commented Feb 10, 2021

github-actions bot commented Mar 13, 2021

github-actions bot commented Mar 18, 2021

[v1.7][v1.6] storage.total_limit_size is not working properly #2878

[v1.7][v1.6] storage.total_limit_size is not working properly #2878

Comments

manojna10 commented Dec 19, 2020 • edited Loading

Bug Report

manojna10 commented Dec 20, 2020

manojna10 commented Dec 20, 2020

JeffLuoo commented Dec 20, 2020

manojna10 commented Dec 21, 2020

sjentzsch commented Jan 13, 2021

JeffLuoo commented Jan 13, 2021 • edited Loading

manojna10 commented Jan 13, 2021

JeffLuoo commented Jan 15, 2021

sjentzsch commented Jan 15, 2021

JeffLuoo commented Jan 15, 2021

JeffLuoo commented Jan 16, 2021

manojna10 commented Feb 9, 2021

JeffLuoo commented Feb 10, 2021

github-actions bot commented Mar 13, 2021

github-actions bot commented Mar 18, 2021

manojna10 commented Dec 19, 2020 •

edited

Loading

JeffLuoo commented Jan 13, 2021 •

edited

Loading