Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed. #1022

hoperays · 2019-01-11T11:08:18Z

Bug Report

Describe the bug
Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed.

To Reproduce
Temporarily no method of reproduction.

Expected behavior
Fluent-bit does not stuck.

Screenshots
The log of fluent-bit stopped in 2019/01/05 21:57:41.

[root@node-1 ~]# kubectl get pod -n openstack -o wide | grep fluentbit | grep node-1
fluentbit-rvrm8                                  1/1       Running   1          4d        10.10.1.3       node-1
[root@node-1 ~]# kubectl logs -n openstack fluentbit-rvrm8 --tail=10 -f
[2019/01/05 21:57:11] [error] [io] TCP connection failed: fluentd-logging:24224 (No route to host)
[2019/01/05 21:57:11] [error] [out_fw] no upstream connections available
[2019/01/05 21:57:11] [error] [io] TCP connection failed: fluentd-logging:24224 (No route to host)
[2019/01/05 21:57:11] [error] [out_fw] no upstream connections available
[2019/01/05 21:57:11] [error] [io] TCP connection failed: fluentd-logging:24224 (No route to host)
[2019/01/05 21:57:11] [error] [out_fw] no upstream connections available
[2019/01/05 21:57:21] [ warn] net_tcp_fd_connect: getaddrinfo(host='fluentd-logging'): Name or service not known
[2019/01/05 21:57:21] [error] [out_fw] no upstream connections available
[2019/01/05 21:57:41] [ warn] net_tcp_fd_connect: getaddrinfo(host='fluentd-logging'): Temporary failure in name resolution
[2019/01/05 21:57:41] [error] [out_fw] no upstream connections available

The stack status of fluent-bit process as follow:

[root@node-1 ~]# ps -ef | grep fluent
root     15799  9283  0 11:07 pts/8    00:00:00 grep --color=auto fluent
root     16211 16193  0 1月05 ?       00:00:06 /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf
[root@node-1 ~]# cat /proc/16211/stack
[<ffffffff81209a40>] pipe_wait+0x70/0xc0
[<ffffffff81209ce9>] pipe_write+0x1f9/0x530
[<ffffffff811fffdd>] do_sync_write+0x8d/0xd0
[<ffffffff81200a9d>] vfs_write+0xbd/0x1e0
[<ffffffff812018af>] SyS_write+0x7f/0xe0
[<ffffffff816b50c9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Your Environment

Version used:

Fluent-bit 0.13.4

Fluentd 1.2.2

Configuration:

[SERVICE]
    Daemon false
    Flush 5
    Log_Level info
    Parsers_File parsers.conf

[INPUT]
    Buffer_Max_Size 2MB
    DB /var/log/flb_kube.db
    DB.Sync OFF
    Mem_Buf_Limit 5MB
    Name tail
    Parser docker
    Path /var/log/containers/*.log
    Tag ${HOSTNAME}.kube.*

[FILTER]
    Match ${HOSTNAME}.kube.*
    Merge_JSON_Log true
    Name kubernetes

[FILTER]
    Match *
    Name record_modifier
    Record hostname ${HOSTNAME}

[OUTPUT]
    Host ${FLUENTD_HOST}
    Match *
    Name forward
    Port ${FLUENTD_PORT}
    Retry_Limit False

Environment name and version:

Kubernetes v1.9.8

Operating System and version:

CentOS 7.4.1708

Filters and plugins:

Input Plugin: Tail

Output Plugin: Forward

Additional context
For more information on the status at that time, please refer to the attached core file.
core.16211.gz

I have restored the fluent-bit service by restarting it, but I would like to known the root cause of fluent-bit process stuck. Sorry to bother you.

The text was updated successfully, but these errors were encountered:

sergeyg-earnin · 2019-06-19T16:27:21Z

the same issue after we redeployed fluentd in k8s cluster

zhulinwei · 2019-12-05T03:29:55Z

the same issue after we redeployed fluentd in k8s cluster...

zhulinwei · 2019-12-14T10:22:17Z

I found an interesting situation.

If use kubectl delete pod fluentd-pod, fluent-bit sometime will stuck and lost connection with fluentd, even if fluentd resumed.

But if use kubectl rollout restart deploy fluentd, the problem will not happen.

abhishek-sehgal954 · 2020-05-15T03:42:13Z

Hey, I am facing the same problem. I am dealing over some critical data. Has anyone found a workaround for this situation?

joezwlin · 2020-05-25T07:17:15Z

Hi
Got the same problem when one of the Load Balancer's Host is unavailable temporarily, I'm trying to adjust retry_limit to see if it can be resolved.

tirelibirefe · 2020-07-04T11:14:23Z

does your Fluentd has a working service listening port 24224?

asaushkin · 2021-03-16T15:43:34Z

I run into the same problem. Flunt-bit and Fluend are running on the EC2 instances. Fluent-bit couldn't be recovered after fleuntd became temporarily unavailable.

LinkMaq · 2021-04-02T10:01:22Z

the same problem on fluentbit 1.7.2， fluentbit and fleuntbit are deployed on kubernetes, fluentbit use headless-service forward logs to fluentd. This problem is very frequent.

[2021/04/02 08:50:29] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)
[2021/04/02 08:50:29] [error] [net] cannot connect to cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240
[2021/04/02 08:50:29] [error] [output:forward:forward.0] no upstream connections available
[2021/04/02 08:50:29] [ warn] [engine] failed to flush chunk '1-1617353299.120165002.flb', retry in 8 seconds: task_id=0, input=tail.0 > output=forward.0 (out_id=0)
[2021/04/02 08:51:04] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)
[2021/04/02 08:51:04] [error] [net] cannot connect to cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240
[2021/04/02 08:51:04] [error] [output:forward:forward.0] no upstream connections available
[2021/04/02 08:51:04] [ warn] [engine] failed to flush chunk '1-1617353334.496987689.flb', retry in 6 seconds: task_id=1, input=tail.0 > output=forward.0 (out_id=0)
[2021/04/02 08:51:44] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)
[2021/04/02 08:51:44] [error] [net] cannot connect to cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240
[2021/04/02 08:51:44] [error] [output:forward:forward.0] no upstream connections available
[2021/04/02 08:51:44] [ warn] [engine] failed to flush chunk '1-1617353374.267709871.flb', retry in 8 seconds: task_id=2, input=tail.0 > output=forward.0 (out_id=0)
[2021/04/02 08:52:44] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)
[2021/04/02 08:52:44] [error] [net] cannot connect to cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240
[2021/04/02 08:52:44] [error] [output:forward:forward.0] no upstream connections available
[2021/04/02 08:52:44] [ warn] [engine] failed to flush chunk '1-1617353434.367870814.flb', retry in 10 seconds: task_id=3, input=tail.0 > output=forward.0 (out_id=0)
[2021/04/02 08:53:15] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)

loguido · 2021-04-12T12:26:02Z

Same happens to me, only way to solve for now is restart fluent-bit

LynnTh · 2021-06-23T07:43:00Z

Same issue.

VincentQiu2018 · 2021-11-17T07:10:58Z

Same issue

leonardo-albertovich · 2021-11-17T11:06:07Z

If anyone here has a reliable reproduction and is able to perform some tests with me contact me in the fluent slack and we'll found out the root of the issue.

github-actions · 2022-02-16T01:54:34Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

lecaros · 2022-03-28T20:23:57Z

Hi everyone here,
We've released a couple of fixes that handle connection loss and timeout scenarios in 1.8.15 and 1.9.1.
I'm closing this issue now, but if you still see the problem, feel free to reopen it or open a new one. We'll gladly assist you further once you provide a repro scenario.

hoperays changed the title ~~Fluent-bit stuck when it lost connection with fluentd, and still did not respond when fluentd resumed.~~ Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed. Jan 11, 2019

konstantin-kornienko mentioned this issue May 31, 2021

Fluentbit gets stuck [multiple issues] #3581

Closed

fujimotos mentioned this issue Jun 3, 2021

Fluent bit gets stuck after "Bad file descriptor" error #3540

Closed

VincentQiu2018 mentioned this issue Nov 17, 2021

Fluent Bit would not resume the upstream connection automatically #4313

Closed

lecaros mentioned this issue Dec 21, 2021

Known Issue: Data is not flowing under some conditions #4505

Closed

github-actions bot added Stale and removed Stale labels Feb 16, 2022

lecaros closed this as completed Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed. #1022

Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed. #1022

hoperays commented Jan 11, 2019 •

edited

Loading

sergeyg-earnin commented Jun 19, 2019

zhulinwei commented Dec 5, 2019

zhulinwei commented Dec 14, 2019

abhishek-sehgal954 commented May 15, 2020

joezwlin commented May 25, 2020

tirelibirefe commented Jul 4, 2020

asaushkin commented Mar 16, 2021

LinkMaq commented Apr 2, 2021

loguido commented Apr 12, 2021

LynnTh commented Jun 23, 2021

VincentQiu2018 commented Nov 17, 2021

leonardo-albertovich commented Nov 17, 2021 •

edited

Loading

github-actions bot commented Feb 16, 2022

lecaros commented Mar 28, 2022

Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed. #1022

Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed. #1022

Comments

hoperays commented Jan 11, 2019 • edited Loading

Bug Report

sergeyg-earnin commented Jun 19, 2019

zhulinwei commented Dec 5, 2019

zhulinwei commented Dec 14, 2019

abhishek-sehgal954 commented May 15, 2020

joezwlin commented May 25, 2020

tirelibirefe commented Jul 4, 2020

asaushkin commented Mar 16, 2021

LinkMaq commented Apr 2, 2021

loguido commented Apr 12, 2021

LynnTh commented Jun 23, 2021

VincentQiu2018 commented Nov 17, 2021

leonardo-albertovich commented Nov 17, 2021 • edited Loading

github-actions bot commented Feb 16, 2022

lecaros commented Mar 28, 2022

hoperays commented Jan 11, 2019 •

edited

Loading

leonardo-albertovich commented Nov 17, 2021 •

edited

Loading