-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
input_chunk: Get real chunk size when calculating fs_chunks_size #3054
Conversation
Signed-off-by: Jeff Luo <jeffluoo@google.com>
Steps to reproduce the issue mentioned above:
|
thanks! |
interesting.. while I am investigating on the bad file descriptor issue.. I found another bugs..
It is weird that even if I used the
|
Update: I compared the chunk_size returned by
It looks like while removing chunks.. we need to first check whether the chunk is UP or DOWN.. my guess to the result above is that the fluent bit is removing some UP chunks and some DOWN chunks.. WDYT about this? @edsiper |
FYI: I pushed a little change on input_chunk to fail any mapping of the previous chunk that cannot be read. I don't think is associated but just in case... so you say something around here is wrong: ssize_t flb_input_chunk_get_real_size(struct flb_input_chunk *ic)
{
return cio_chunk_get_real_size(ic->chunk);
} do you have the chunk file to inspect it ? |
That is just my guessing... the chunk size (real_size vs content size) I printed out above is the size of the chunk that we need to drop in order to place the new chunk fluent-bit/src/flb_input_chunk.c Line 235 in 69b4f2f
what do you mean by |
if you are using file system based chunk, I am wondering if the chunk it self is corrupted in some way (path/tail.0/something.flb) |
@edsiper Any idea that I can inspect the chunk content while running? |
I think is better if the chunks are in the filesystem, so if you hit the issue you can get a copy of that chunk and inspect it manually (I can help with that). But note that also a previous version of the code (recently fixed) might map a corrupted chunk, now upon any error the chunk won't be allowed to be processed. |
I saw that update. I will give it a try tmr! Thank you for the help. |
Update: I re-run fluent bit on cluster with the latest code on master branch and below is the print out I got:
So I think it proofs my guessing that the UP chunk doesn't have real_size and only has content size. To my understanding, if When the Filesystem buffering is enabled, the behavior of the engine is different, upon Chunk creation, it stores the content in memory but also it maps a copy on disk (through mmap(2)), this Chunk is active in memory and backed up in disk is called to be up which means "the chunk content is up in memory". |
@JeffLuoo I started testing chunkio library independently and found the following: note: clone it, compile it and run the unit test
Analysis:
So I found a problem: 1. CHECK SIZE: real_size=0, content_size=0 this is just after do cio_chunk_open() which creates a file, no content no metadata, si values looks fine. 2. CHECK SIZE: real_size=24, content_size=0 after calling cio_chunk_down(), on this case the chunk is synced to disk and the metadata structure is placed, so real_size=24 looks good. 3. CHECK SIZE: real_size=24, content_size=0 call cio_chunk_up(), values looks fine as expected. 4. CHECK SIZE: real_size=24, content_size=409600 before calculate the values, we call cio_chunk_write() and append content of 409600 bytes length. So values looks good, since no sync has been done to disk ,real_size is 24 which is fine. 5. CHECK SIZE: real_size=24, content_size=409600 do cio_chunk_sync() and calculate results. Here is the error, real_size is still reported as 24 but the real file on disk reports 409624. [work in process] |
relevant fix and test on chunkio repo: |
hmm maybe would be good too to sync to disk a chunk as soon as is created.... (so at least get's metadata in place) |
FYI: I've synced ChunkIO fixed on Fluent Bit GIT master. Can you re-try reproduce the issue with the latest changes ?, if everything is good we might be able to release v1.7 |
@edsiper If it is the case you mentioned "sync to disk a chunk as soon as is created", will the size of chunk on disk update when data is written into the UP chunk on memory? |
that's a helper but just realized that Chunks Size can be altered as soon as they are created to optimize I/O. But I think recent changes on GIT master might help on the root cause of the problem. |
ok let me try the latest change. |
@edsiper Tested on latest build with the latest change but saw the following error....
|
I can share my kubernetes yaml file used to reproduce it.. Debugging in a cluster environment is not efficient but I can't reproduce similar error locally.. Edit: From error message it looks like when dropping the chunk in order to place the new chunk, the disk file of the chunk to be deleted does not exist
and
|
there is something wrong with function |
potentially we are passing an invalid Tag because of missing the check of the return value ? https://github.com/fluent/fluent-bit/blob/master/src/flb_input_chunk.c#L503 |
Yes, the error happens when
|
is there a way that a Chunk is trying to be deleted even when there is no metadata on it ? |
I re-run it with tag print out by adding:
and the error log:
and I saw tag is (null) |
Why there is a missing Tag in the Chunk? Looks like a corner case
…On Sun, Feb 14, 2021, 15:00 Jeff Luo ***@***.***> wrote:
I re-run it with tag print out by adding:
/* Retrieve Tag */
flb_input_chunk_get_tag(ic, &tag_buf, &tag_len);
flb_debug("[input chunk] flb_input_chunk_get_tag(3) returns tag value %s for chunk %s", tag_buf, flb_input_chunk_get_name(ic));
and the error log:
[2021/02/14 20:59:11] [debug] [input chunk] remove route of chunk 1-1613336185.281302277.flb with size 246817 bytes to output plugin forward.0 to place the incoming data with size 282 bytes,
[2021/02/14 20:59:11] [debug] [task] drop task_id 7 with no active route from input plugin tail.0
[2021/02/14 20:59:11] [debug] [task] destroy task=0x7f33928387c0 (task_id=7)
[2021/02/14 20:59:11] [debug] [input chunk] flb_input_chunk_get_tag(3) returns tag value (null) for chunk 1-1613336185.281302277.flb
#0 0x555d39fe2725 in _mm_loadu_si128() at gcc/x86_64-linux-gnu/8/include/emmintrin.h:703
#1 0x555d39fe2725 in XXH3_accumulate_512_sse2() at lib/xxHash-0.8.0/xxhash.h:3262
#2 0x555d39fe2ed5 in XXH3_accumulate() at lib/xxHash-0.8.0/xxhash.h:3675
#3 0x555d39fe2f71 in XXH3_hashLong_internal_loop() at lib/xxHash-0.8.0/xxhash.h:3697
#4 0x555d39fe31ae in XXH3_hashLong_64b_internal() at lib/xxHash-0.8.0/xxhash.h:3760
#5 0x555d39fe3267 in XXH3_hashLong_64b_default() at lib/xxHash-0.8.0/xxhash.h:3793
#6 0x555d39fe342d in XXH3_64bits_internal() at lib/xxHash-0.8.0/xxhash.h:3860
#7 0x555d39fe3468 in XXH3_64bits() at lib/xxHash-0.8.0/xxhash.h:3868
#8 0x555d39ecdcb5 in flb_hash_del_ptr() at src/flb_hash.c:90
#9 0x555d39f019dc in flb_input_chunk_destroy() at src/flb_input_chunk.c:513
#10 0x555d39ee7a6d in flb_task_destroy() at src/flb_task.c:435
#11 0x555d39f01056 in flb_input_chunk_find_space_new_data() at src/flb_input_chunk.c:250
#12 0x555d39f012c3 in flb_input_chunk_place_new_chunk() at src/flb_input_chunk.c:311
#13 0x555d39f01b27 in input_chunk_get() at src/flb_input_chunk.c:562
#14 0x555d39f02005 in flb_input_chunk_append_raw() at src/flb_input_chunk.c:772
#15 0x555d39f1cdf5 in process_content() at plugins/in_tail/tail_file.c:358
#16 0x555d39f1ea98 in flb_tail_file_chunk() at plugins/in_tail/tail_file.c:981
#17 0x555d39f19372 in in_tail_collect_event() at plugins/in_tail/tail.c:261
#18 0x555d39f23ba4 in tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#19 0x555d39ed612e in flb_input_collector_fd() at src/flb_input.c:1004
#20 0x555d39ee5d8c in flb_engine_handle_event() at src/flb_engine.c:352
#21 0x555d39ee5d8c in flb_engine_start() at src/flb_engine.c:613
#22 0x555d39ecbdb2 in flb_lib_worker() at src/flb_lib.c:493
#23 0x7f339548efa2 in ???() at ???:0
#24 0x7f3394ace4ce in ???() at ???:0
#25 0xffffffffffffffff in ???() at ???:0
and I saw tag is (null)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3054 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAC2INT2QHW23HYTRQM2G63S7A2YBANCNFSM4XNRSNIA>
.
|
I am not sure about that as the tag of chunk should be assigned when the chunk is created before adding data. And according to the logs there are indeed datas in the chunk |
Can u attach that chunk to the ticket? (As well can u check the Chunk is
up/down)?
…On Sun, Feb 14, 2021, 15:32 Jeff Luo ***@***.***> wrote:
I am not sure about that as the tag of chunk should be assigned when the
chunk is created before adding data. And according to the logs there are
indeed datas in the chunk 1-1613336185.281302277.flb.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3054 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAC2INS5KMGG23OUMOISI3DS7A6ODANCNFSM4XNRSNIA>
.
|
what do you mean "attach that chunk to the ticket"? |
The Chunk must exists in the file system, I mean to attach the .flb file
…On Sun, Feb 14, 2021, 15:41 Jeff Luo ***@***.***> wrote:
what do you mean "attach that chunk to the ticket"?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3054 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAC2INT64L6YM2TIY3L26ODS7A7PRANCNFSM4XNRSNIA>
.
|
Chunk is down this time according to log:
I am testing on GKE node and looks like I am having some troubles to download files from that node.... |
Adding tag checking will solve the issue but does the tag shouldn't be null..?
|
yes, that solves the issue (crash). I think the root cause looks like this: engine wants to drop a chunk because the queue is full, but if the chunk is down, it's not possible to retrieve the metadata so Tag gets NULL, but if it gets null, the chunk or task cannot be handled properly. I am testing forcing a chunk up once tag retrieval is required. |
I've placed fix 3a4879c:
|
@JeffLuoo thanks so much for your continuous help on this! |
@edsiper Thank you for the fix! |
@edsiper Hi Eduardo, I ran Fluent Bit on master branch and added some print outs to the function
I want to compare the chunk content size before and after Thanks. edit: Oh nvm. It might be filtered out... Maybe I should update the debug message : D |
Signed-off-by: Jeff Luo jeffluoo@google.com
According to issue #2878 there is a case:
When storage.type is filesystem, and mem_buf_limit is x and storage.total_limit_size is y, and if y > x(mainstream case), once it reaches y with some chunks being UP(upto x memory) and remaining DOWN in fs, over the next few seconds, it slowly falls back to total disk size of only x and only have UP chunks. So it essentially stores the same amount of data in disk which is in memory.
And the cause is that if the chunk is up and if we are over the configured memory limit, we will put the chunks down from memory by calls munmap and then set the chunks data_size to 0. When we loop through all the chunks to find some space when disk limit is reached, these down chunks were considered as 0 bytes and removed as well.
#2878
Enter
[N/A]
in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
Documentation
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.