Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fleunt-Bit crashed with SIGSEV error #6897

Closed
ezienecker opened this issue Feb 22, 2023 · 10 comments
Closed

Fleunt-Bit crashed with SIGSEV error #6897

ezienecker opened this issue Feb 22, 2023 · 10 comments
Labels
Stale waiting-for-release This has been fixed/merged but it's waiting to be included in a release.

Comments

@ezienecker
Copy link

We have recently updated to fluent-bit 2.0.9 (Helm Chart version 0.24.0). Since then we regularly receive the error code 139 SIGSEV.

The following error message appears in the logs:

[2023/02/22 07:27:34] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2023/02/22 07:29:06] [engine] caught signal (SIGSEGV)
#0  0x55c299f28772      in  edata_arena_ind_get() at lib/jemalloc-5.3.0/include/jemalloc/internal/edata.h:258
#1  0x55c299f28772      in  tcache_bin_flush_impl() at lib/jemalloc-5.3.0/src/tcache.c:350
#2  0x55c299f28772      in  tcache_bin_flush_bottom() at lib/jemalloc-5.3.0/src/tcache.c:519
#3  0x55c299f28772      in  je_tcache_bin_flush_small() at lib/jemalloc-5.3.0/src/tcache.c:529
#4  0x55c299f29cb9      in  tcache_gc_small() at lib/jemalloc-5.3.0/src/tcache.c:148
#5  0x55c299f2bd71      in  ???() at lib/jemalloc-5.3.0/src/tcache.c:414
#6  0x55c299f2e62f      in  je_te_event_trigger() at lib/jemalloc-5.3.0/src/thread_event.c:299
#7  0x55c299ebf6ac      in  te_event_advance() at lib/jemalloc-5.3.0/include/jemalloc/internal/thread_event.h:287
#8  0x55c299ebf6ac      in  thread_dalloc_event() at lib/jemalloc-5.3.0/include/jemalloc/internal/thread_event.h:293
#9  0x55c299ebf6ac      in  ifree() at lib/jemalloc-5.3.0/src/jemalloc.c:2896
#10 0x55c299ebf6ac      in  je_free_default() at lib/jemalloc-5.3.0/src/jemalloc.c:3021
#11 0x55c29a497053      in  map_metric_destroy() at lib/cmetrics/src/cmt_map.c:160
#12 0x55c29a4973f3      in  cmt_map_destroy() at lib/cmetrics/src/cmt_map.c:273
#13 0x55c29a480110      in  cmt_counter_destroy() at lib/cmetrics/src/cmt_counter.c:94
#14 0x55c29a4a57ff      in  cmt_destroy() at lib/cmetrics/src/cmetrics.c:101
#15 0x55c29a016e23      in  collect_metrics() at src/flb_metrics_exporter.c:201
#16 0x55c29a016f57      in  flb_me_fd_event() at src/flb_metrics_exporter.c:253
#17 0x55c299f9d7d0      in  flb_engine_handle_event() at src/flb_engine.c:497
#18 0x55c299f9d7d0      in  flb_engine_start() at src/flb_engine.c:853
#19 0x55c299f44b24      in  flb_lib_worker() at src/flb_lib.c:629
#20 0x7f181e43bea6      in  ???() at ???:0
#21 0x7f181dcefa2e      in  ???() at ???:0
#22 0xffffffffffffffff  in  ???() at ???:0

For other instances, the following error message is seen:

[2023/02/21 20:33:54] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2023/02/21 20:43:54] [engine] caught signal (SIGSEGV)
#0  0x55c4622aae03      in  atomic_load_p() at lib/jemalloc-5.3.0/include/jemalloc/internal/atomic.h:83
#1  0x55c4622aae03      in  arena_get_from_edata() at lib/jemalloc-5.3.0/include/jemalloc/internal/arena_inlines_b.h:16
#2  0x55c4622aae03      in  je_large_dalloc() at lib/jemalloc-5.3.0/src/large.c:271
#3  0x55c46224e700      in  arena_dalloc_large() at lib/jemalloc-5.3.0/include/jemalloc/internal/arena_inlines_b.h:297
#4  0x55c46224e700      in  arena_dalloc() at lib/jemalloc-5.3.0/include/jemalloc/internal/arena_inlines_b.h:334
#5  0x55c46224e700      in  idalloctm() at lib/jemalloc-5.3.0/include/jemalloc/internal/jemalloc_internal_inlines_c.h:120
#6  0x55c46224e700      in  ifree() at lib/jemalloc-5.3.0/src/jemalloc.c:2887
#7  0x55c46224e700      in  je_free_default() at lib/jemalloc-5.3.0/src/jemalloc.c:3014
#8  0x55c4622e967a      in  flb_free() at include/fluent-bit/flb_mem.h:121
#9  0x55c4622ea913      in  flb_sds_destroy() at src/flb_sds.c:470
#10 0x55c46261b220      in  pack_record() at plugins/out_loki/loki.c:1233
#11 0x55c46261b6de      in  loki_compose_payload() at plugins/out_loki/loki.c:1381
#12 0x55c46261b7bd      in  cb_loki_flush() at plugins/out_loki/loki.c:1408
#13 0x55c4623079ae      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:528
#14 0x55c462d6e3a6      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

This error also occurs in version

  • Helm Chart version 0.23.0 (fluent-bit version 2.0.8)
  • Helm Chart version 0.22.0 (fluent-bit version 2.0.8)
  • Helm Chart version 0.21.0 (fluent-bit version not checked)

Temporarily I have downgraded to version 1.9.9 (Helm Chart version 0.20.11). Everything seems to work so far.

Maybe a bug was introduce with version 2.x?

@patrick-stephens
Copy link
Contributor

Can you provide the full configuration you're using that triggers the error?
More of the FB logs as well could help.

@ezienecker
Copy link
Author

This is the current configuration:

image:
  pullPolicy: IfNotPresent

resources:
  limits:
    cpu: 100m
    memory: 256Mi
  requests:
    cpu: 100m
    memory: 128Mi

logLevel: info

config:
  inputs: |
    [INPUT]
        Name                    tail
        Path                    /var/log/containers/*.log
        Parser                  cri
        Tag                     kube.*
        Mem_Buf_Limit           5MB
        Buffer_Chunk_Size       64KB
        Buffer_Max_Size         128KB
        Skip_Long_Lines         On

  filters: |
    [FILTER]
        Name                    kubernetes
        Match                   kube.*
        K8S-Logging.Parser      On
        K8S-Logging.Exclude     On
        Buffer_Size             256KB

    # Only keep logs from namespaces containing 'test1' or 'test2'
    [FILTER]
        Name                    grep
        Match                   kube.*
        Regex                   $kubernetes['namespace_name'] ((?:.+-)?(test1|test2)(?:infra-.+)?)
    
    # Append environment to tag
    [FILTER]
        Name                    rewrite_tag
        Match                   kube.*
        Rule                    $kubernetes['namespace_name'] ((?:.+-)?(test1|test2)(?:infra-.+)?) $2.$TAG false

  outputs: |
    [OUTPUT]
        Name                    loki
        Match                   test1.kube.*
        Host                    loki
        port                    3100
        labels                  environment=test1
        remove_keys             stream,logtag,kubernetes
        drop_single_key         on
        auto_kubernetes_labels  on
    
    [OUTPUT]
        Name                    loki
        Match                   test2.kube.*
        Host                    loki
        port                    3100
        labels                  environment=test2
        line_format             key_value
        remove_keys             stream,logtag,kubernetes,kubernetes_namespace
        drop_single_key         on
        auto_kubernetes_labels  on


  customParsers: |
    [PARSER]
        Name                    cri
        Format                  regex
        Regex                   ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
        Time_Key                time
        Time_Format             %Y-%m-%dT%H:%M:%S.%L%z

@Valt25
Copy link

Valt25 commented Feb 22, 2023

Seems similar error. But with different exception path.
[2023/02/22 15:14:07] [engine] caught signal (SIGSEGV) #0 0x7f2227844ad8 in ???() at 4/multiarch/strlen-evex.S:77 #1 0x7f222773af75 in __vfprintf_internal() at fprintf-internal.c:1688 #2 0x7f222774c9c5 in __vsnprintf_internal() at f.c:114 #3 0x55dd178d36f2 in flb_sds_printf() at src/flb_sds.c:429 #4 0x55dd17a7b452 in debug_event_mask() at plugins/in_tail/tail_fs_inotify.c:69 #5 0x55dd17a7b924 in tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:199 #6 0x55dd178e2d4a in flb_input_collector_fd() at src/flb_input.c:1882 #7 0x55dd179157aa in flb_engine_handle_event() at src/flb_engine.c:490 #8 0x55dd179157aa in flb_engine_start() at src/flb_engine.c:853 #9 0x55dd178bcb24 in flb_lib_worker() at src/flb_lib.c:629 #10 0x7f2227f1aea6 in start_thread() at reate.c:477 #11 0x7f22277cea2e in ???() at sysv/linux/x86_64/clone.S:95 #12 0xffffffffffffffff in ???() at ???:0

That happens only with FLB_LOG_LEVEL env as debug.

And everything fine with 2.0.8 version of fluent-bit

@nokute78
Copy link
Collaborator

@Valt25 Your error seems to be different from original issue and be same #6797. Could you check it ?

@ezienecker
Copy link
Author

Is there any progress on this topic?

@leonardo-albertovich
Copy link
Collaborator

@nokute78 fixed an issue that could be related to it, yesterday I generated the test containers for master which I think you should be able to grab from ghcr.io/fluent/fluent-bit/test/master or you could build it yourself. It would be great if you could give it a try.

I'm also working on a PR to fix issue #6911 so if you use the dynamic tenant id feature I would really appreciate your input. The branch name is leonardo-master-loki-tenant_id-race-fix but I can build test containers for it once I'm done with the improvement I'm currently working on.

@payparain
Copy link

That happens only with FLB_LOG_LEVEL env as debug.

And everything fine with 2.0.8 version of fluent-bit

This was the situation for our Fluent-Bit installation. Setting the logLevel to info resolved the segfault.

@patrick-stephens
Copy link
Contributor

See #6958 as well

@patrick-stephens patrick-stephens added waiting-for-release This has been fixed/merged but it's waiting to be included in a release. and removed status: waiting-for-triage labels Mar 15, 2023
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Jun 14, 2023
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 20, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Stale waiting-for-release This has been fixed/merged but it's waiting to be included in a release.
Projects
None yet
Development

No branches or pull requests

6 participants