Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

input_chunk: Get real chunk size when calculating fs_chunks_size #3054

Merged
merged 1 commit into from
Feb 10, 2021

Conversation

JeffLuoo
Copy link
Contributor

@JeffLuoo JeffLuoo commented Feb 10, 2021

Signed-off-by: Jeff Luo jeffluoo@google.com

According to issue #2878 there is a case:
When storage.type is filesystem, and mem_buf_limit is x and storage.total_limit_size is y, and if y > x(mainstream case), once it reaches y with some chunks being UP(upto x memory) and remaining DOWN in fs, over the next few seconds, it slowly falls back to total disk size of only x and only have UP chunks. So it essentially stores the same amount of data in disk which is in memory.

And the cause is that if the chunk is up and if we are over the configured memory limit, we will put the chunks down from memory by calls munmap and then set the chunks data_size to 0. When we loop through all the chunks to find some space when disk limit is reached, these down chunks were considered as 0 bytes and removed as well.

#2878

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

  • Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Signed-off-by: Jeff Luo <jeffluoo@google.com>
@JeffLuoo
Copy link
Contributor Author

JeffLuoo commented Feb 10, 2021

Steps to reproduce the issue mentioned above:
Direct to the input_chunk test file and add the follow script to the test file:

#define JSON_1   "[1449505010, {\"message\": \"some value abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz %llu\"}]"

void flb_test_input_chunk_dropping_chunks()
{
    int ret;
    int in_ffd;
    int out_ffd;
    flb_ctx_t *ctx;
 
    /* Create library context */
    ctx = flb_create();
    if (!ctx) {
        return -1;
    }
 
   ret = flb_service_set(ctx,
                          FLB_CONF_STR_FLUSH, "1",
                          "storage.path", "/var/log/flbstorage",
                          "HTTP_Server", "on",
                          "HTTP_Listen", "0.0.0.0",
                          "HTTP_Port", "2020",
                          "storage.metrics", "on",
                          "Log_Level", "debug",
                          NULL);
    if (ret != 0) {
        flb_destroy(ctx);
        return -1;
    }
 
    /* Enable the input plugin for manual data ingestion */
    in_ffd = flb_input(ctx, "lib", NULL);
    if (in_ffd == -1) {
        flb_destroy(ctx);
        return -1;
    }
 
        ret = flb_input_set(ctx, in_ffd,
                        "storage.type", "filesystem",
                        "mem_buf_limit", "10M",
                        NULL);
 
        if (ret < 0) {
        flb_destroy(ctx);
        return -1;
    }
 
    out_ffd = flb_output(ctx, "syslog", NULL);
    if (out_ffd == -1) {
        flb_destroy(ctx);
        return -1;
    }
 
    ret = flb_output_set(ctx, out_ffd,
                             "match", "*",
                             "host", "0.0.4.4",
                             "port", "514",
                             "mode", "tcp",
                             "syslog_message_key", "message",
                             "Retry_Limit", "False",
                             "net.keepalive", "off",
                             "storage.total_limit_size", "30M",
                             NULL);
 
    if (ret < 0) {
        flb_destroy(ctx);
        return -1;
    }
 
 
    /* Start the engine */
    ret = flb_start(ctx);
    if (ret == -1) {
        flb_destroy(ctx);
        return -1;
    }
 
    unsigned long long count = 0;
    char buff[1000];
    int len;
    while(1) {
        count++;
        printf("INFO Total number of bytes sent %llu\n", count * 1000);
        len = snprintf(buff, 1000, JSON_1, count);
        /* Ingest data manually */
        flb_lib_push(ctx, in_ffd, buff, len);
        usleep(1000);
    }
 
    /* Stop the engine (5 seconds to flush remaining data) */
    flb_stop(ctx);
 
    /* Destroy library context, release all resources */
    flb_destroy(ctx);
}

@edsiper
Copy link
Member

edsiper commented Feb 10, 2021

thanks!

@JeffLuoo
Copy link
Contributor Author

interesting.. while I am investigating on the bad file descriptor issue.. I found another bugs..
I am running fluent bit with this patch with some custom print outs. I noticed some logs

jeffluoo@jeffluoo.c ~% k logs -n logging fluentbit-forwarder-rn9hb -f | grep "remove route"
[2021/02/13 03:14:40] [debug] [input chunk] remove route of chunk 1-1613186005.764098195.flb with size 0 bytes to output plugin forward.0 to place the incoming data with size 123332 bytes, the new fs_chunks_size is 59995771 bytes
[2021/02/13 03:14:40] [debug] [input chunk] remove route of chunk 1-1613186005.989650829.flb with size 2084095 bytes to output plugin forward.0 to place the incoming data with size 123332 bytes, the new fs_chunks_size is 57911676 bytes
[2021/02/13 03:14:40] [debug] [input chunk] remove route of chunk 1-1613186006.231761486.flb with size 2091172 bytes to output plugin forward.0 to place the incoming data with size 123285 bytes, the new fs_chunks_size is 57909534 bytes
[2021/02/13 03:14:42] [debug] [input chunk] remove route of chunk 1-1613186006.469965586.flb with size 2085625 bytes to output plugin forward.0 to place the incoming data with size 11804 bytes, the new fs_chunks_size is 57914570 bytes
[2021/02/13 03:14:46] [debug] [input chunk] remove route of chunk 1-1613186006.733319123.flb with size 0 bytes to output plugin forward.0 to place the incoming data with size 3870 bytes, the new fs_chunks_size is 60033276 bytes
[2021/02/13 03:14:46] [debug] [input chunk] remove route of chunk 1-1613186006.962696059.flb with size 695390 bytes to output plugin forward.0 to place the incoming data with size 4392 bytes, the new fs_chunks_size is 59348588 bytes
[2021/02/13 03:14:47] [debug] [input chunk] remove route of chunk 1-1613186007.587371.flb with size 2088044 bytes to output plugin forward.0 to place the incoming data with size 3793 bytes, the new fs_chunks_size is 57920286 bytes
[2021/02/13 03:14:51] [debug] [input chunk] remove route of chunk 1-1613186007.275307694.flb with size 2089122 bytes to output plugin forward.0 to place the incoming data with size 123319 bytes, the new fs_chunks_size is 57792739 bytes
[2021/02/13 03:14:54] [debug] [input chunk] remove route of chunk 1-1613186007.572437085.flb with size 2090919 bytes to output plugin forward.0 to place the incoming data with size 290 bytes, the new fs_chunks_size is 57909203 bytes
[2021/02/13 03:14:57] [debug] [input chunk] remove route of chunk 1-1613186007.771087397.flb with size 0 bytes to output plugin forward.0 to place the incoming data with size 274 bytes, the new fs_chunks_size is 60001456 bytes
[2021/02/13 03:14:57] [debug] [input chunk] remove route of chunk 1-1613186008.9751497.flb with size 1783239 bytes to output plugin forward.0 to place the incoming data with size 313 bytes, the new fs_chunks_size is 58218979 bytes
[2021/02/13 03:15:03] [debug] [input chunk] remove route of chunk 1-1613186008.966740421.flb with size 0 bytes to output plugin forward.0 to place the incoming data with size 292 bytes, the new fs_chunks_size is 60000288 bytes
[2021/02/13 03:15:03] [debug] [input chunk] remove route of chunk 1-1613186009.968701816.flb with size 0 bytes to output plugin forward.0 to place the incoming data with size 273 bytes, the new fs_chunks_size is 60001068 bytes
[2021/02/13 03:15:03] [debug] [input chunk] remove route of chunk 1-1613186010.969881425.flb with size 447916 bytes to output plugin forward.0 to place the incoming data with size 236 bytes, the new fs_chunks_size is 59553913 bytes
[2021/02/13 03:15:04] [debug] [input chunk] remove route of chunk 1-1613186011.993279355.flb with size 676128 bytes to output plugin forward.0 to place the incoming data with size 964 bytes, the new fs_chunks_size is 59325127 bytes
[2021/02/13 03:15:05] [debug] [input chunk] remove route of chunk 1-1613186012.969769096.flb with size 647534 bytes to output plugin forward.0 to place the incoming data with size 2421 bytes, the new fs_chunks_size is 59354101 bytes
[2021/02/13 03:15:06] [debug] [input chunk] remove route of chunk 1-1613186014.2233692.flb with size 696755 bytes to output plugin forward.0 to place the incoming data with size 3880 bytes, the new fs_chunks_size is 59302014 bytes

It is weird that even if I used the flb_input_chunk_get_real_size but the chunk size is still 0 as we can see from logs above

[2021/02/13 03:14:40] [debug] [input chunk] remove route of chunk 1-1613186005.764098195.flb with size 0 bytes to output plugin forward.0 to place the incoming data with size 123332 bytes, the new fs_chunks_size is 59995771 bytes

@JeffLuoo
Copy link
Contributor Author

Update:

I compared the chunk_size returned by flb_input_chunk_get_size(1) and flb_input_chunk_get_real_size(1):

jeffluoo@jeffluoo.c ~% k logs -n logging fluentbit-forwarder-wfx4c -f | grep "Old Chunk"
[2021/02/13 03:28:39] [debug] [input chunk] Old Chunk 1-1613186850.72176818.flb: the real_size is 2084708 bytes and the content size is 0 bytes
[2021/02/13 03:28:42] [debug] [input chunk] Old Chunk 1-1613186850.462084942.flb: the real_size is 0 bytes and the content size is 2086557 bytes
[2021/02/13 03:28:42] [debug] [input chunk] Old Chunk 1-1613186850.577884719.flb: the real_size is 2086611 bytes and the content size is 0 bytes
[2021/02/13 03:28:46] [debug] [input chunk] Old Chunk 1-1613186850.788861094.flb: the real_size is 0 bytes and the content size is 1043597 bytes
[2021/02/13 03:28:46] [debug] [input chunk] Old Chunk 1-1613186850.983952710.flb: the real_size is 2087061 bytes and the content size is 0 bytes
[2021/02/13 03:28:50] [debug] [input chunk] Old Chunk 1-1613186851.228092577.flb: the real_size is 0 bytes and the content size is 2088188 bytes
[2021/02/13 03:28:50] [debug] [input chunk] Old Chunk 1-1613186851.476461570.flb: the real_size is 2087974 bytes and the content size is 0 bytes
[2021/02/13 03:28:54] [debug] [input chunk] Old Chunk 1-1613186851.733023462.flb: the real_size is 2084615 bytes and the content size is 0 bytes
[2021/02/13 03:28:59] [debug] [input chunk] Old Chunk 1-1613186851.969074898.flb: the real_size is 347876 bytes and the content size is 0 bytes
[2021/02/13 03:29:00] [debug] [input chunk] Old Chunk 1-1613186851.988208880.flb: the real_size is 2082746 bytes and the content size is 2082576 bytes
[2021/02/13 03:29:04] [debug] [input chunk] Old Chunk 1-1613186852.268613851.flb: the real_size is 2086539 bytes and the content size is 0 bytes
[2021/02/13 03:29:08] [debug] [input chunk] Old Chunk 1-1613186852.558409858.flb: the real_size is 1091909 bytes and the content size is 0 bytes
[2021/02/13 03:29:10] [debug] [input chunk] Old Chunk 1-1613186852.969767372.flb: the real_size is 0 bytes and the content size is 665488 bytes
[2021/02/13 03:29:10] [debug] [input chunk] Old Chunk 1-1613186853.965823636.flb: the real_size is 595099 bytes and the content size is 594929 bytes
[2021/02/13 03:29:11] [debug] [input chunk] Old Chunk 1-1613186854.965263044.flb: the real_size is 499541 bytes and the content size is 0 bytes
[2021/02/13 03:29:12] [debug] [input chunk] Old Chunk 1-1613186855.965474345.flb: the real_size is 823478 bytes and the content size is 0 bytes
[2021/02/13 03:29:14] [debug] [input chunk] Old Chunk 1-1613186856.964349041.flb: the real_size is 840215 bytes and the content size is 0 bytes
[2021/02/13 03:29:15] [debug] [input chunk] Old Chunk 1-1613186857.968277122.flb: the real_size is 0 bytes and the content size is 693509 bytes
[2021/02/13 03:29:15] [debug] [input chunk] Old Chunk 1-1613186858.970721210.flb: the real_size is 408458 bytes and the content size is 0 bytes

It looks like while removing chunks.. we need to first check whether the chunk is UP or DOWN.. my guess to the result above is that the fluent bit is removing some UP chunks and some DOWN chunks.. WDYT about this? @edsiper

@edsiper
Copy link
Member

edsiper commented Feb 13, 2021

FYI: I pushed a little change on input_chunk to fail any mapping of the previous chunk that cannot be read. I don't think is associated but just in case...

so you say something around here is wrong:

ssize_t flb_input_chunk_get_real_size(struct flb_input_chunk *ic)
{
    return cio_chunk_get_real_size(ic->chunk);
}

do you have the chunk file to inspect it ?

@JeffLuoo
Copy link
Contributor Author

JeffLuoo commented Feb 13, 2021

That is just my guessing... the chunk size (real_size vs content size) I printed out above is the size of the chunk that we need to drop in order to place the new chunk

o_ins->fs_chunks_size -= old_ic_bytes;

what do you mean by do you have the chunk file to inspect it ?

@edsiper
Copy link
Member

edsiper commented Feb 13, 2021

if you are using file system based chunk, I am wondering if the chunk it self is corrupted in some way (path/tail.0/something.flb)

@JeffLuoo
Copy link
Contributor Author

@edsiper Any idea that I can inspect the chunk content while running? The real_size is 0 bytes looks weird to me.. I am not sure what will make chunks corrupt.

@edsiper
Copy link
Member

edsiper commented Feb 13, 2021

I think is better if the chunks are in the filesystem, so if you hit the issue you can get a copy of that chunk and inspect it manually (I can help with that). But note that also a previous version of the code (recently fixed) might map a corrupted chunk, now upon any error the chunk won't be allowed to be processed.

@JeffLuoo
Copy link
Contributor Author

I saw that update. I will give it a try tmr! Thank you for the help.

@JeffLuoo
Copy link
Contributor Author

JeffLuoo commented Feb 13, 2021

Update:

I re-run fluent bit on cluster with the latest code on master branch and below is the print out I got:

jeffluoo@jeffluoo.c ~% k logs -n logging fluentbit-forwarder-x4pqd -f | grep "Old Chunk" 
[2021/02/13 20:05:16] [debug] [input chunk] Old Chunk 1-1613246642.746347382.flb is up ? 1 and the real_size is 0 bytes and the content size is 1391083 bytes
[2021/02/13 20:05:16] [debug] [input chunk] Old Chunk 1-1613246643.216639988.flb is up ? 1 and the real_size is 0 bytes and the content size is 2088129 bytes
[2021/02/13 20:05:16] [debug] [input chunk] Old Chunk 1-1613246643.726888995.flb is up ? 0 and the real_size is 1740396 bytes and the content size is 0 bytes
[2021/02/13 20:05:20] [debug] [input chunk] Old Chunk 1-1613246644.62207371.flb is up ? 0 and the real_size is 2088549 bytes and the content size is 0 bytes
[2021/02/13 20:05:23] [debug] [input chunk] Old Chunk 1-1613246644.494620411.flb is up ? 0 and the real_size is 2088743 bytes and the content size is 0 bytes
[2021/02/13 20:05:27] [debug] [input chunk] Old Chunk 1-1613246644.978184931.flb is up ? 0 and the real_size is 347084 bytes and the content size is 0 bytes
[2021/02/13 20:05:28] [debug] [input chunk] Old Chunk 1-1613246645.66966551.flb is up ? 0 and the real_size is 2088484 bytes and the content size is 0 bytes
[2021/02/13 20:05:32] [debug] [input chunk] Old Chunk 1-1613246645.486606083.flb is up ? 1 and the real_size is 0 bytes and the content size is 2087834 bytes
[2021/02/13 20:05:32] [debug] [input chunk] Old Chunk 1-1613246645.992097928.flb is up ? 0 and the real_size is 348009 bytes and the content size is 0 bytes
[2021/02/13 20:05:33] [debug] [input chunk] Old Chunk 1-1613246646.210778454.flb is up ? 0 and the real_size is 2087776 bytes and the content size is 0 bytes
[2021/02/13 20:05:37] [debug] [input chunk] Old Chunk 1-1613246646.574571788.flb is up ? 0 and the real_size is 2089200 bytes and the content size is 0 bytes
[2021/02/13 20:05:40] [debug] [input chunk] Old Chunk 1-1613246647.65638327.flb is up ? 1 and the real_size is 0 bytes and the content size is 841936 bytes
[2021/02/13 20:05:40] [debug] [input chunk] Old Chunk 1-1613246647.964191892.flb is up ? 0 and the real_size is 605766 bytes and the content size is 0 bytes

So I think it proofs my guessing that the UP chunk doesn't have real_size and only has content size. To my understanding, if storage.type is set to filesystem, UP chunks will be mapped to a chunk on disk according to:

When the Filesystem buffering is enabled, the behavior of the engine is different, upon Chunk creation, it stores the content in memory but also it maps a copy on disk (through mmap(2)), this Chunk is active in memory and backed up in disk is called to be up which means "the chunk content is up in memory".

@edsiper
Copy link
Member

edsiper commented Feb 14, 2021

@JeffLuoo I started testing chunkio library independently and found the following:

note: clone it, compile it and run the unit test fs_up_down, (I've added some debug messages) e.g:

Test fs_up_down...                              
[cio-test-fs] created root path /tmp/cio-fs-test/                          => src/chunkio.c:48
1. CHECK SIZE: real_size=0, content_size=0
2. CHECK SIZE: real_size=24, content_size=0
3. CHECK SIZE: real_size=24, content_size=0
4. CHECK SIZE: real_size=24, content_size=409600
5. CHECK SIZE: real_size=24, content_size=409600

Analysis:

  • real_size aims to be the real size of the file in the file system
  • content_size: the number of bytes as content, not necessary synced to disk yet

So I found a problem:

1. CHECK SIZE: real_size=0, content_size=0

this is just after do cio_chunk_open() which creates a file, no content no metadata, si values looks fine.

2. CHECK SIZE: real_size=24, content_size=0

after calling cio_chunk_down(), on this case the chunk is synced to disk and the metadata structure is placed, so real_size=24 looks good.

3. CHECK SIZE: real_size=24, content_size=0

call cio_chunk_up(), values looks fine as expected.

4. CHECK SIZE: real_size=24, content_size=409600

before calculate the values, we call cio_chunk_write() and append content of 409600 bytes length. So values looks good, since no sync has been done to disk ,real_size is 24 which is fine.

5. CHECK SIZE: real_size=24, content_size=409600

do cio_chunk_sync() and calculate results. Here is the error, real_size is still reported as 24 but the real file on disk reports 409624.

[work in process]

@edsiper
Copy link
Member

edsiper commented Feb 14, 2021

@JeffLuoo

relevant fix and test on chunkio repo:

@edsiper
Copy link
Member

edsiper commented Feb 14, 2021

hmm maybe would be good too to sync to disk a chunk as soon as is created.... (so at least get's metadata in place)

@edsiper
Copy link
Member

edsiper commented Feb 14, 2021

@JeffLuoo

FYI: I've synced ChunkIO fixed on Fluent Bit GIT master. Can you re-try reproduce the issue with the latest changes ?, if everything is good we might be able to release v1.7

@JeffLuoo
Copy link
Contributor Author

@edsiper If it is the case you mentioned "sync to disk a chunk as soon as is created", will the size of chunk on disk update when data is written into the UP chunk on memory?

@edsiper
Copy link
Member

edsiper commented Feb 14, 2021

that's a helper but just realized that Chunks Size can be altered as soon as they are created to optimize I/O. But I think recent changes on GIT master might help on the root cause of the problem.

@JeffLuoo
Copy link
Contributor Author

ok let me try the latest change.

@JeffLuoo
Copy link
Contributor Author

@edsiper Tested on latest build with the latest change but saw the following error....

[2021/02/14 20:27:59] [debug] [input chunk] update plugin forward.0 fs_chunk_size by 744 bytes, the current fs_chunk_size is 39998951 bytes
[2021/02/14 20:27:59] [debug] [input:tail:tail.0] inode=1178491 events: IN_MODIFY 
[2021/02/14 20:27:59] [debug] [input chunk] chunk 1-1613334479.413440716.flb required 227 bytes and 1049 bytes left in plugin forward.0
[2021/02/14 20:27:59] [debug] [input chunk] update plugin forward.0 fs_chunk_size by 715 bytes, the current fs_chunk_size is 39999666 bytes
[2021/02/14 20:27:59] [debug] [input:tail:tail.0] inode=1178491 events: IN_MODIFY 
[2021/02/14 20:27:59] [debug] [input chunk] chunk 1-1613334479.413440716.flb required 293 bytes and 334 bytes left in plugin forward.0
[2021/02/14 20:27:59] [debug] [input chunk] update plugin forward.0 fs_chunk_size by 781 bytes, the current fs_chunk_size is 40000447 bytes
[2021/02/14 20:27:59] [debug] [input:tail:tail.0] inode=1178491 events: IN_MODIFY 
[2021/02/14 20:27:59] [debug] [input chunk] chunk 1-1613334479.413440716.flb required 233 bytes and -447 bytes left in plugin forward.0
[2021/02/14 20:27:59] [debug] [input chunk] remove route of chunk 1-1613334377.982968393.flb with size 1740876 bytes to output plugin forward.0 to place the incoming data with size 233 bytes
[2021/02/14 20:27:59] [debug] [task] drop task_id 5 with no active route from input plugin tail.0
[2021/02/14 20:27:59] [debug] [task] destroy task=0x7f7d30838400 (task_id=5)
[lib/chunkio/src/cio_file.c:299 errno=9] Bad file descriptor
[2021/02/14 20:27:59] [engine] caught signal (SIGSEGV)
#0  0x555f881906de      in  _mm_loadu_si128() at gcc/x86_64-linux-gnu/8/include/emmintrin.h:703
#1  0x555f881906de      in  XXH3_accumulate_512_sse2() at lib/xxHash-0.8.0/xxhash.h:3262
#2  0x555f88190e8e      in  XXH3_accumulate() at lib/xxHash-0.8.0/xxhash.h:3675
#3  0x555f88190f2a      in  XXH3_hashLong_internal_loop() at lib/xxHash-0.8.0/xxhash.h:3697
#4  0x555f88191167      in  XXH3_hashLong_64b_internal() at lib/xxHash-0.8.0/xxhash.h:3760
#5  0x555f88191220      in  XXH3_hashLong_64b_default() at lib/xxHash-0.8.0/xxhash.h:3793
#6  0x555f881913e6      in  XXH3_64bits_internal() at lib/xxHash-0.8.0/xxhash.h:3860
#7  0x555f88191421      in  XXH3_64bits() at lib/xxHash-0.8.0/xxhash.h:3868
#8  0x555f8807bcb5      in  flb_hash_del_ptr() at src/flb_hash.c:90
#9  0x555f880af995      in  flb_input_chunk_destroy() at src/flb_input_chunk.c:512
#10 0x555f88095a6d      in  flb_task_destroy() at src/flb_task.c:435
#11 0x555f880af056      in  flb_input_chunk_find_space_new_data() at src/flb_input_chunk.c:250
#12 0x555f880af2c3      in  flb_input_chunk_place_new_chunk() at src/flb_input_chunk.c:311
#13 0x555f880afae0      in  input_chunk_get() at src/flb_input_chunk.c:561
#14 0x555f880affbe      in  flb_input_chunk_append_raw() at src/flb_input_chunk.c:771
#15 0x555f880cadae      in  process_content() at plugins/in_tail/tail_file.c:358
#16 0x555f880cca51      in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:981
#17 0x555f880c732b      in  in_tail_collect_event() at plugins/in_tail/tail.c:261
#18 0x555f880d1b5d      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#19 0x555f8808412e      in  flb_input_collector_fd() at src/flb_input.c:1004
#20 0x555f88093d8c      in  flb_engine_handle_event() at src/flb_engine.c:352
#21 0x555f88093d8c      in  flb_engine_start() at src/flb_engine.c:613
#22 0x555f88079db2      in  flb_lib_worker() at src/flb_lib.c:493
#23 0x7f7d33557fa2      in  ???() at ???:0
#24 0x7f7d32b974ce      in  ???() at ???:0
#25 0xffffffffffffffff  in  ???() at ???:0

@JeffLuoo
Copy link
Contributor Author

JeffLuoo commented Feb 14, 2021

I can share my kubernetes yaml file used to reproduce it.. Debugging in a cluster environment is not efficient but I can't reproduce similar error locally..

Edit:

From error message it looks like when dropping the chunk in order to place the new chunk, the disk file of the chunk to be deleted does not exist

[2021/02/14 20:27:59] [debug] [task] drop task_id 5 with no active route from input plugin tail.0
[2021/02/14 20:27:59] [debug] [task] destroy task=0x7f7d30838400 (task_id=5)
[lib/chunkio/src/cio_file.c:299 errno=9] Bad file descriptor

and

#8  0x555f8807bcb5      in  flb_hash_del_ptr() at src/flb_hash.c:90

@edsiper
Copy link
Member

edsiper commented Feb 14, 2021

there is something wrong with function int flb_input_chunk_find_space_new_data(). If you look at the stacktrace we are just trying to ingest data into the engine, note that a Task does not exist until flush time.

@edsiper
Copy link
Member

edsiper commented Feb 14, 2021

potentially we are passing an invalid Tag because of missing the check of the return value ?

https://github.com/fluent/fluent-bit/blob/master/src/flb_input_chunk.c#L503

@JeffLuoo
Copy link
Contributor Author

Yes, the error happens when flb_task_destroy()is invoked

                    if (old_ic->task->users == 0) {
                        flb_debug("[task] drop task_id %d with no active route from input plugin %s",
                                  old_ic->task->id, ic->in->name);
                        flb_task_destroy(old_ic->task, FLB_TRUE);
                    }
                }

@edsiper
Copy link
Member

edsiper commented Feb 14, 2021

is there a way that a Chunk is trying to be deleted even when there is no metadata on it ?

@JeffLuoo
Copy link
Contributor Author

I re-run it with tag print out by adding:

    /* Retrieve Tag */
    flb_input_chunk_get_tag(ic, &tag_buf, &tag_len);
    flb_debug("[input chunk] flb_input_chunk_get_tag(3) returns tag value %s for chunk %s", tag_buf, flb_input_chunk_get_name(ic));

and the error log:

[2021/02/14 20:59:11] [debug] [input chunk] remove route of chunk 1-1613336185.281302277.flb with size 246817 bytes to output plugin forward.0 to place the incoming data with size 282 bytes,
[2021/02/14 20:59:11] [debug] [task] drop task_id 7 with no active route from input plugin tail.0
[2021/02/14 20:59:11] [debug] [task] destroy task=0x7f33928387c0 (task_id=7)
[2021/02/14 20:59:11] [debug] [input chunk] flb_input_chunk_get_tag(3) returns tag value (null) for chunk 1-1613336185.281302277.flb
#0  0x555d39fe2725      in  _mm_loadu_si128() at gcc/x86_64-linux-gnu/8/include/emmintrin.h:703
#1  0x555d39fe2725      in  XXH3_accumulate_512_sse2() at lib/xxHash-0.8.0/xxhash.h:3262
#2  0x555d39fe2ed5      in  XXH3_accumulate() at lib/xxHash-0.8.0/xxhash.h:3675
#3  0x555d39fe2f71      in  XXH3_hashLong_internal_loop() at lib/xxHash-0.8.0/xxhash.h:3697
#4  0x555d39fe31ae      in  XXH3_hashLong_64b_internal() at lib/xxHash-0.8.0/xxhash.h:3760
#5  0x555d39fe3267      in  XXH3_hashLong_64b_default() at lib/xxHash-0.8.0/xxhash.h:3793
#6  0x555d39fe342d      in  XXH3_64bits_internal() at lib/xxHash-0.8.0/xxhash.h:3860
#7  0x555d39fe3468      in  XXH3_64bits() at lib/xxHash-0.8.0/xxhash.h:3868
#8  0x555d39ecdcb5      in  flb_hash_del_ptr() at src/flb_hash.c:90
#9  0x555d39f019dc      in  flb_input_chunk_destroy() at src/flb_input_chunk.c:513
#10 0x555d39ee7a6d      in  flb_task_destroy() at src/flb_task.c:435
#11 0x555d39f01056      in  flb_input_chunk_find_space_new_data() at src/flb_input_chunk.c:250
#12 0x555d39f012c3      in  flb_input_chunk_place_new_chunk() at src/flb_input_chunk.c:311
#13 0x555d39f01b27      in  input_chunk_get() at src/flb_input_chunk.c:562
#14 0x555d39f02005      in  flb_input_chunk_append_raw() at src/flb_input_chunk.c:772
#15 0x555d39f1cdf5      in  process_content() at plugins/in_tail/tail_file.c:358
#16 0x555d39f1ea98      in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:981
#17 0x555d39f19372      in  in_tail_collect_event() at plugins/in_tail/tail.c:261
#18 0x555d39f23ba4      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#19 0x555d39ed612e      in  flb_input_collector_fd() at src/flb_input.c:1004
#20 0x555d39ee5d8c      in  flb_engine_handle_event() at src/flb_engine.c:352
#21 0x555d39ee5d8c      in  flb_engine_start() at src/flb_engine.c:613
#22 0x555d39ecbdb2      in  flb_lib_worker() at src/flb_lib.c:493
#23 0x7f339548efa2      in  ???() at ???:0
#24 0x7f3394ace4ce      in  ???() at ???:0
#25 0xffffffffffffffff  in  ???() at ???:0

and I saw tag is (null)

@edsiper
Copy link
Member

edsiper commented Feb 14, 2021 via email

@JeffLuoo
Copy link
Contributor Author

I am not sure about that as the tag of chunk should be assigned when the chunk is created before adding data. And according to the logs there are indeed datas in the chunk 1-1613336185.281302277.flb.

@edsiper
Copy link
Member

edsiper commented Feb 14, 2021 via email

@JeffLuoo
Copy link
Contributor Author

what do you mean "attach that chunk to the ticket"?

@edsiper
Copy link
Member

edsiper commented Feb 14, 2021 via email

@JeffLuoo
Copy link
Contributor Author

JeffLuoo commented Feb 14, 2021

Chunk is down this time according to log:

[2021/02/14 21:49:49] [debug] [input chunk] chunk 1-1613339389.328942711.flb required 501 bytes and -292 bytes left in plugin forward.0
[2021/02/14 21:49:49] [debug] [input chunk] remove route of chunk 1-1613339252.343859558.flb with size 347872 bytes to output plugin forward.0 to place the incoming data with size 501 bytes,
[2021/02/14 21:49:49] [debug] [task] drop task_id 6 with no active route from input plugin tail.0
[2021/02/14 21:49:49] [debug] [task] destroy task=0x7f116ae385e0 (task_id=6)
[lib/chunkio/src/cio_file.c:299 errno=9] Bad file descriptor
[2021/02/14 21:49:49] [debug] [input chunk] flb_input_chunk_get_tag(3) returns tag value (null) for chunk 1-1613339252.343859558.flb. Is chunk up ? 0
[2021/02/14 21:49:49] [engine] caught signal (SIGSEGV)
#0  0x559b263f6788      in  _mm_loadu_si128() at gcc/x86_64-linux-gnu/8/include/emmintrin.h:703
#1  0x559b263f6788      in  XXH3_accumulate_512_sse2() at lib/xxHash-0.8.0/xxhash.h:3262
#2  0x559b263f6f38      in  XXH3_accumulate() at lib/xxHash-0.8.0/xxhash.h:3675
#3  0x559b263f6fd4      in  XXH3_hashLong_internal_loop() at lib/xxHash-0.8.0/xxhash.h:3697
#4  0x559b263f7211      in  XXH3_hashLong_64b_internal() at lib/xxHash-0.8.0/xxhash.h:3760
#5  0x559b263f72ca      in  XXH3_hashLong_64b_default() at lib/xxHash-0.8.0/xxhash.h:3793
#6  0x559b263f7490      in  XXH3_64bits_internal() at lib/xxHash-0.8.0/xxhash.h:3860
#7  0x559b263f74cb      in  XXH3_64bits() at lib/xxHash-0.8.0/xxhash.h:3868
#8  0x559b262e1cb5      in  flb_hash_del_ptr() at src/flb_hash.c:90
#9  0x559b26315a3b      in  flb_input_chunk_destroy() at src/flb_input_chunk.c:527
#10 0x559b262fba6d      in  flb_task_destroy() at src/flb_task.c:435
#11 0x559b2631509a      in  flb_input_chunk_find_space_new_data() at src/flb_input_chunk.c:253
#12 0x559b26315307      in  flb_input_chunk_place_new_chunk() at src/flb_input_chunk.c:314
#13 0x559b26315b8a      in  input_chunk_get() at src/flb_input_chunk.c:576
#14 0x559b26316068      in  flb_input_chunk_append_raw() at src/flb_input_chunk.c:786
#15 0x559b26330e58      in  process_content() at plugins/in_tail/tail_file.c:358
#16 0x559b26332afb      in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:981
#17 0x559b2632d3d5      in  in_tail_collect_event() at plugins/in_tail/tail.c:261
#18 0x559b26337c07      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#19 0x559b262ea12e      in  flb_input_collector_fd() at src/flb_input.c:1004
#20 0x559b262f9d8c      in  flb_engine_handle_event() at src/flb_engine.c:352
#21 0x559b262f9d8c      in  flb_engine_start() at src/flb_engine.c:613
#22 0x559b262dfdb2      in  flb_lib_worker() at src/flb_lib.c:493
#23 0x7f116dabffa2      in  ???() at ???:0
#24 0x7f116d0ff4ce      in  ???() at ???:0
#25 0xffffffffffffffff  in  ???() at ???:0

I am testing on GKE node and looks like I am having some troubles to download files from that node....

@JeffLuoo
Copy link
Contributor Author

Adding tag checking will solve the issue but does the tag shouldn't be null..?

    ret = flb_input_chunk_get_tag(ic, &tag_buf, &tag_len);
    if (ret == -1) {
        flb_error("[input chunk] error retrieving tag of input chunk");
        cio_chunk_close(ic->chunk, del);
        mk_list_del(&ic->_head);
        flb_free(ic);
        return ret;
    }

@edsiper
Copy link
Member

edsiper commented Feb 14, 2021

yes, that solves the issue (crash).

I think the root cause looks like this: engine wants to drop a chunk because the queue is full, but if the chunk is down, it's not possible to retrieve the metadata so Tag gets NULL, but if it gets null, the chunk or task cannot be handled properly.

I am testing forcing a chunk up once tag retrieval is required.

@edsiper
Copy link
Member

edsiper commented Feb 15, 2021

I've placed fix 3a4879c:

  • check if the chunk is down, if so put the chunk up
  • validate return status of tag retrieval
  • do not drop cache entry if tag is NULL

@edsiper
Copy link
Member

edsiper commented Feb 15, 2021

@JeffLuoo thanks so much for your continuous help on this!

@JeffLuoo
Copy link
Contributor Author

@edsiper Thank you for the fix!

@JeffLuoo
Copy link
Contributor Author

JeffLuoo commented Mar 7, 2021

@edsiper Hi Eduardo, I ran Fluent Bit on master branch and added some print outs to the function flb_input_chunk_append_raw(5) and see below logs:

[2021/03/07 17:14:54] [debug] [storage] systemd.4:1-1615137294.876917297.flb adjusting size OK
[2021/03/07 17:14:54] [debug] [storage] systemd.4:1-1615137294.876917297.flb mapped OK
[2021/03/07 17:14:54] [debug] [input chunk] chunk 1-1615137294.876917297.flb required 5145 bytes and 49994414 bytes left in plugin stackdriver.1
[2021/03/07 17:14:54] [debug] [input chunk] Chunk 1-1615137294.876917297.flb before write: Content_size 0 vs Real_size (null)
[2021/03/07 17:14:54] [debug] [input chunk] Chunk 1-1615137294.876917297.flb after write (and filter): Content_size 0 vs Real_size (null)
[2021/03/07 17:14:54] [debug] [input chunk] Chunk 1-1615137294.876917297.flb: The diff is 36 bytes
[2021/03/07 17:14:54] [debug] [input chunk] Chunk 1-1615137294.876917297.flb: content_size 0 bytes vs real size 0 bytes
[2021/03/07 17:14:54] [debug] [input chunk] Remove chunk 1-1615137294.876917297.flb with 0 bytes, the fs_chunk_size is 5586 bytes
[2021/03/07 17:14:54] [debug] [storage] [cio file] synced at: systemd.4/1-1615137294.876917297.flb

I want to compare the chunk content size before and after flb_input_chunk_write(3). I expect the content size to be the size of bytes written into the chunk (e.g. 5145 bytes from logs above).. However the content size is 0 all the time from logs. I am just wondering that do I mis-understand the function flb_input_chunk_write(3) or there is a specific case that will cause this issue?

Thanks.

edit: Oh nvm. It might be filtered out... Maybe I should update the debug message : D

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants