Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[v1.7][v1.6] storage.total_limit_size is not working properly #2878

Closed
manojna10 opened this issue Dec 19, 2020 · 15 comments
Closed

[v1.7][v1.6] storage.total_limit_size is not working properly #2878

manojna10 opened this issue Dec 19, 2020 · 15 comments
Labels

Comments

@manojna10
Copy link

manojna10 commented Dec 19, 2020

Bug Report

There are 2 issues when the output plugin is unable to forward the data so it needs buffering and storage, nullifying the use of disk for storage.

Issue 1: When storage.type is filesystem, and mem_buf_limit is x and storage.total_limit_size is y, and if y > x(mainstream case), once it reaches y with some chunks being UP(upto x memory) and remaining DOWN in fs, over the next few seconds, it slowly falls back to total disk size of only x and only have UP chunks. So it essentially stores the same amount of data in disk which is in memory.

Issue 2: when storage.type is filesystem and mem_buf_limit is not set but storage.total_limit_size is y and max_chunks_up are set to z, if y is reached before max up chunks reaches z, it works fine, i,e, we store max of y. But if max up chunks reaches z before total size reaches y, then new chunks are created as DOWN chunks, but it never stops at y, it continues to store the data beyond the configured size of y.

To Reproduce

  • Steps to reproduce the problem:
  • Added screen shots for Issue 1
  • Reproducibility: 100%
  • 1 input plugin (lib) and 1 output plugin (syslog)
  • configured 10M mem_buf_limit and 30M storage.total_limit_size
  • it reaches 30M with 49 UP chunks and 93 DOWN chunks
  • then slowly goes back to 10M with only 47 UP chunks and it remains there.
  • Happens in both v1.7 and v1.6
  • basic api code that I used to reproduce consistently: https://pastebin.com/yD6kBx6T

Expected behavior
It should stay at 30M and not go back to 10M.

Screenshots
issue1_part1: just started FLB
issue1_part1_starting_flb
issue1_part2: reached total_limit_size after crossing mem_buf_limit
issue1_part2_reached_storage_limit
issue1_part3: It goes back to mem_buf_limit size
issue1_part3_storage_goes_back_to_max_mem_config

Your Environment

  • Version used: v1.7, v1.6
  • Configuration:
  • Environment name and version (e.g. Kubernetes? What version?): Ubuntu
  • Server type and version:
  • Operating System and version:
  • Filters and plugins: 1 input plugin(lib), 1 output filter(syslog)

Additional context

@manojna10 manojna10 changed the title [v1.7][v1.6] storage.total_limit_size is not working [v1.7][v1.6] storage.total_limit_size is not working properly Dec 19, 2020
@manojna10
Copy link
Author

concerns:

  1. flb_input_chunk_find_space_new_data() doesnt account for the overlimit_routes_mask but rather looping through all the output instances again, is it not needed or its by design? anyways I am testing with 1 output plugin, so this should not be the potential reason for this issue.

  2. valid chunks with lots of data(as shown below using du) are getting deleted in flb_input_chunk_find_space_new_data() as shown below in gdb:
    16K 10323-1608342739.704776149.flb
    476K 10323-1608342935.913228244.flb
    504K 10323-1608342936.866337320.flb
    504K 10323-1608342937.865617453.flb
    504K 10323-1608342938.865773090.flb
    500K 10323-1608342939.866210468.flb
    500K 10323-1608342940.865507508.flb
    496K 10323-1608342941.865974941.flb

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$30 = 0
$31 = 0x7ffff001ab00 "10323-1608342739.704776149.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$30 = 0
$31 = 0x7ffff0019eb0 "10323-1608342935.913228244.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$32 = 0
$33 = 0x7ffff001a170 "10323-1608342936.866337320.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$34 = 0
$35 = 0x7ffff0019450 "10323-1608342937.865617453.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$36 = 0
$37 = 0x7ffff00197e0 "10323-1608342938.865773090.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$38 = 0
$39 = 0x7ffff001a4f0 "10323-1608342939.866210468.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$40 = 0
$41 = 0x7ffff001a8a0 "10323-1608342940.865507508.flb"
(gdb) c
Continuing.

Thread 2 "flb-pipeline" hit Breakpoint 3, flb_input_chunk_find_space_new_data (ic=0x7ffff00951c0, overlimit_routes_mask=1, chunk_size=555)
at /root/Documents/fluent-bit-167/fluent-bit/src/flb_input_chunk.c:228
228 flb_debug("[input chunk] remove route of chunk %s with size %ld bytes to output plugin %s "
$42 = 507464
$43 = 0x7ffff001b190 "10323-1608342941.865974941.flb"

@manojna10
Copy link
Author

when we schedule a retry, if the chunk is up and if we are over the configured memory limit, we put the chunks down which calls munmap which sets the chunks data size to 0. When we loop through all the chunks to find some space when disk limit is reached, these down chunks were considered as 0 bytes and removed as well.
With my limited knowledge of the code I commented out the code that sets the data size to 0 and it seems to work fine for the first issue.

@JeffLuoo
Copy link
Contributor

Hi @manojna10
For your two concerns #2878 (comment)

The first one I think might be the case I missed when implemented it, I will double-check it later. For the second concern could you please elaborate more on it? Because it might be the case that the program needs to drop some chunks to place the new chunk.

@manojna10
Copy link
Author

Issue 2:
storage.type is filesystem
mem_buf_limit is not set
storage.total_limit_size set to y
max_chunks_up set to z

case 1, working fine: If disk size reaches y before max up chunks reaches z: all chunks are up in memory and the newly added chunks by replacing the older UP chunks also remains UP. so, it works as expected i,e, we use max disk space of y.

case 2, having issue: But if max up chunks reaches z before total size reaches y, then new chunks are created as DOWN chunks, but it keeps adding new chunks even after disk size reaches y (didn't actually debug to see whether its discarding any old chunks or not, it will good to add a metric for that as well). the disk usage never stop never stops at configured y: it continues to store the data beyond the configured size of y.

@sjentzsch
Copy link

I can confirm storage.total_limit_size not being properly respected:

      [SERVICE]
          [...]
          storage.path /flb-storage/
          storage.sync normal
          storage.checksum off
          storage.max_chunks_up 2048
          storage.backlog.mem_limit 64M
          storage.metrics on
          [...]
      [OUTPUT]
          Name  es
          [...]
          storage.total_limit_size 30G
/flb-storage # du -sh tail.0/
48.6G    tail.0/

If we would have multiple such OUTPUT sections, the capacity would add up, right? If so, is there also any way to limit it globally for all (first-come-first-serve)?

@JeffLuoo
Copy link
Contributor

JeffLuoo commented Jan 13, 2021

Hi @sjentzsch The capacity should not be added up to a larger number. Each output plugin has its own limit set by storage.total_limit_size. Therefore it means that storage.total_limit_size is not a global limit. Could you please let me know how do you get a 48.6G tail.0/ with storage.total_limit_size 30G? Is elasticsearch your only output plugin?

@manojna10
Copy link
Author

This looks similar to the issue 2 I mentioned.

@JeffLuoo
Copy link
Contributor

@manojna10 Thank you. I will try to reproduce it.

@sjentzsch
Copy link

@JeffLuoo We indeed had multiple output plugins to Elasticsearch (two or three), each with a 30G limit. So I have to agree, it could be that they have added up. However, I suspect that we rather ran into the issue described here, as usually only one of our es outputs accumulates Gigabytes of data (the others are rather silent). No proof though, unfortunately.

@JeffLuoo
Copy link
Contributor

@sjentzsch Thank you. I will take a look.

@JeffLuoo
Copy link
Contributor

@manojna10 Hi could you please share with me the file like https://pastebin.com/yD6kBx6T you used to reproduce issue 2 if possible? Appreciate that!

@manojna10
Copy link
Author

image

flb_client.txt

Hi @JeffLuoo I tried reproducing this issue. But regardless of how many times I tried, max_chunks_up configuration was not honored and there were no DOWN chunks at all even after the configured number of max_chunks_up is reached. Attaching the screenshot and the client file I used.

I even tried using #2804 to keep the number of busy chunks less than than the total number of up chunks, it didn't help with the above situation as well.

I used the latest Master code as well to see the behavior and its still the same. Not sure if I am missing anything here as it used to work before.

Because of this, I am unable to try reproduce the original issue 2 reported here.

Thanks,
Manoj

@JeffLuoo
Copy link
Contributor

@manojna10 Thank you... I just created a PR for issue one #3054

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Mar 13, 2021
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants