Skip to content

Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed. #1022

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
hoperays opened this issue Jan 11, 2019 · 14 comments

Comments

@hoperays
Copy link

hoperays commented Jan 11, 2019

Bug Report

Describe the bug
Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed.

To Reproduce
Temporarily no method of reproduction.

Expected behavior
Fluent-bit does not stuck.

Screenshots
The log of fluent-bit stopped in 2019/01/05 21:57:41.

[root@node-1 ~]# kubectl get pod -n openstack -o wide | grep fluentbit | grep node-1
fluentbit-rvrm8                                  1/1       Running   1          4d        10.10.1.3       node-1
[root@node-1 ~]# kubectl logs -n openstack fluentbit-rvrm8 --tail=10 -f
[2019/01/05 21:57:11] [error] [io] TCP connection failed: fluentd-logging:24224 (No route to host)
[2019/01/05 21:57:11] [error] [out_fw] no upstream connections available
[2019/01/05 21:57:11] [error] [io] TCP connection failed: fluentd-logging:24224 (No route to host)
[2019/01/05 21:57:11] [error] [out_fw] no upstream connections available
[2019/01/05 21:57:11] [error] [io] TCP connection failed: fluentd-logging:24224 (No route to host)
[2019/01/05 21:57:11] [error] [out_fw] no upstream connections available
[2019/01/05 21:57:21] [ warn] net_tcp_fd_connect: getaddrinfo(host='fluentd-logging'): Name or service not known
[2019/01/05 21:57:21] [error] [out_fw] no upstream connections available
[2019/01/05 21:57:41] [ warn] net_tcp_fd_connect: getaddrinfo(host='fluentd-logging'): Temporary failure in name resolution
[2019/01/05 21:57:41] [error] [out_fw] no upstream connections available


The stack status of fluent-bit process as follow:

[root@node-1 ~]# ps -ef | grep fluent
root     15799  9283  0 11:07 pts/8    00:00:00 grep --color=auto fluent
root     16211 16193  0 1月05 ?       00:00:06 /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf
[root@node-1 ~]# cat /proc/16211/stack
[<ffffffff81209a40>] pipe_wait+0x70/0xc0
[<ffffffff81209ce9>] pipe_write+0x1f9/0x530
[<ffffffff811fffdd>] do_sync_write+0x8d/0xd0
[<ffffffff81200a9d>] vfs_write+0xbd/0x1e0
[<ffffffff812018af>] SyS_write+0x7f/0xe0
[<ffffffff816b50c9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Your Environment

  • Version used:
  • Fluent-bit 0.13.4
  • Fluentd 1.2.2
  • Configuration:
[SERVICE]
    Daemon false
    Flush 5
    Log_Level info
    Parsers_File parsers.conf

[INPUT]
    Buffer_Max_Size 2MB
    DB /var/log/flb_kube.db
    DB.Sync OFF
    Mem_Buf_Limit 5MB
    Name tail
    Parser docker
    Path /var/log/containers/*.log
    Tag ${HOSTNAME}.kube.*

[FILTER]
    Match ${HOSTNAME}.kube.*
    Merge_JSON_Log true
    Name kubernetes

[FILTER]
    Match *
    Name record_modifier
    Record hostname ${HOSTNAME}

[OUTPUT]
    Host ${FLUENTD_HOST}
    Match *
    Name forward
    Port ${FLUENTD_PORT}
    Retry_Limit False

  • Environment name and version:
  • Kubernetes v1.9.8
  • Operating System and version:
  • CentOS 7.4.1708
  • Filters and plugins:
  • Input Plugin: Tail
  • Output Plugin: Forward

Additional context
For more information on the status at that time, please refer to the attached core file.
core.16211.gz

I have restored the fluent-bit service by restarting it, but I would like to known the root cause of fluent-bit process stuck. Sorry to bother you.

@hoperays hoperays changed the title Fluent-bit stuck when it lost connection with fluentd, and still did not respond when fluentd resumed. Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed. Jan 11, 2019
@sergeyg-earnin
Copy link

the same issue after we redeployed fluentd in k8s cluster

@zhulinwei
Copy link

the same issue after we redeployed fluentd in k8s cluster...

@zhulinwei
Copy link

I found an interesting situation.

If use kubectl delete pod fluentd-pod, fluent-bit sometime will stuck and lost connection with fluentd, even if fluentd resumed.

But if use kubectl rollout restart deploy fluentd, the problem will not happen.

@abhishek-sehgal954
Copy link

Hey, I am facing the same problem. I am dealing over some critical data. Has anyone found a workaround for this situation?

@joezwlin
Copy link

Hi
Got the same problem when one of the Load Balancer's Host is unavailable temporarily, I'm trying to adjust retry_limit to see if it can be resolved.

@tirelibirefe
Copy link

does your Fluentd has a working service listening port 24224?

@asaushkin
Copy link

I run into the same problem. Flunt-bit and Fluend are running on the EC2 instances. Fluent-bit couldn't be recovered after fleuntd became temporarily unavailable.

@LinkMaq
Copy link

LinkMaq commented Apr 2, 2021

the same problem on fluentbit 1.7.2, fluentbit and fleuntbit are deployed on kubernetes, fluentbit use headless-service forward logs to fluentd. This problem is very frequent.

[2021/04/02 08:50:29] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)
[2021/04/02 08:50:29] [error] [net] cannot connect to cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240
[2021/04/02 08:50:29] [error] [output:forward:forward.0] no upstream connections available
[2021/04/02 08:50:29] [ warn] [engine] failed to flush chunk '1-1617353299.120165002.flb', retry in 8 seconds: task_id=0, input=tail.0 > output=forward.0 (out_id=0)
[2021/04/02 08:51:04] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)
[2021/04/02 08:51:04] [error] [net] cannot connect to cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240
[2021/04/02 08:51:04] [error] [output:forward:forward.0] no upstream connections available
[2021/04/02 08:51:04] [ warn] [engine] failed to flush chunk '1-1617353334.496987689.flb', retry in 6 seconds: task_id=1, input=tail.0 > output=forward.0 (out_id=0)
[2021/04/02 08:51:44] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)
[2021/04/02 08:51:44] [error] [net] cannot connect to cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240
[2021/04/02 08:51:44] [error] [output:forward:forward.0] no upstream connections available
[2021/04/02 08:51:44] [ warn] [engine] failed to flush chunk '1-1617353374.267709871.flb', retry in 8 seconds: task_id=2, input=tail.0 > output=forward.0 (out_id=0)
[2021/04/02 08:52:44] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)
[2021/04/02 08:52:44] [error] [net] cannot connect to cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240
[2021/04/02 08:52:44] [error] [output:forward:forward.0] no upstream connections available
[2021/04/02 08:52:44] [ warn] [engine] failed to flush chunk '1-1617353434.367870814.flb', retry in 10 seconds: task_id=3, input=tail.0 > output=forward.0 (out_id=0)
[2021/04/02 08:53:15] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)

@loguido
Copy link

loguido commented Apr 12, 2021

Same happens to me, only way to solve for now is restart fluent-bit

@LynnTh
Copy link

LynnTh commented Jun 23, 2021

Same issue.

@VincentQiu2018
Copy link

Same issue

@leonardo-albertovich
Copy link
Collaborator

leonardo-albertovich commented Nov 17, 2021

If anyone here has a reliable reproduction and is able to perform some tests with me contact me in the fluent slack and we'll found out the root of the issue.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added Stale and removed Stale labels Feb 16, 2022
@lecaros
Copy link
Contributor

lecaros commented Mar 28, 2022

Hi everyone here,
We've released a couple of fixes that handle connection loss and timeout scenarios in 1.8.15 and 1.9.1.
I'm closing this issue now, but if you still see the problem, feel free to reopen it or open a new one. We'll gladly assist you further once you provide a repro scenario.

@lecaros lecaros closed this as completed Mar 28, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests