Skip to content

Fluent bit stops uploading logs when there are too many open chunks #3541

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
MPeli opened this issue May 24, 2021 · 9 comments
Closed

Fluent bit stops uploading logs when there are too many open chunks #3541

MPeli opened this issue May 24, 2021 · 9 comments
Labels
Stale Windows Bugs and requests about Windows platforms

Comments

@MPeli
Copy link

MPeli commented May 24, 2021

Bug Report

Describe the bug
Fluent bit stops uploading logs when there are too many open chunks.

To Reproduce
Not really sure how. I will add more details once I found more. I happens on a server where are 1500 log files in a folder (not sure if it matters)

[error] [storage] [cio file] too many open chunks on 'tail.0/6768-1621880199.517401800.flb' (cio_file_up() line=791)

Expected behavior
Fluent bit should not get stuck and try to reconnect.

Your Environment

Fluent Bit Enterprise - SOS Report
==================================
The following report aims to be used by Fluent Bit and Fluentd community users.


[Fluent Bit]
    Version             1.7.6
    Built Flags          FLB_HAVE_PARSER FLB_HAVE_RECORD_ACCESSOR FLB_HAVE_STREAM_PROCESSOR JSMN_PARENT_LINKS JSMN_STRICT FLB_HAVE_TLS FLB_HAVE_AWS FLB_HAVE_SIGNV4 FLB_HAVE_SQLDB FLB_HAVE_TRACE FLB_HAVE_TIMESPEC_GET FLB_HAVE_PROXY_GO FLB_HAVE_REGEX FLB_HAVE_UTF8_ENCODER FLB_HAVE_LUAJIT

[Operating System]
    Name                Windows
    Version             6.2
    Build               9200

[Hardware]
    Architecture        x64 (AMD or Intel)
    Processors          12

[Built Plugins]
    Inputs              emitter tail dummy statsd storage_backlog stream_processor winlog tcp lib forward random
    Filters             alter_size aws record_modifier throttle throttle_size kubernetes modify nest parser expect grep rewrite_tag lua stdout geoip2
    Outputs             azure azure_blob counter datadog es file forward http influxdb logdna loki nrlogs null slack splunk stackdriver stdout syslog tcp flowcounter gelf websocket cloudwatch_logs kinesis_streams s3

[SERVER] Runtime configuration
    Flush               5.000000
    Daemon              On
    Log_Level           Trace

[INPUT] Instance
    Name                tail.0 (tail, id=0)
    Flags
    Threaded            No
    Tag                 <appname>-<pid>
    Mem_Buf_Limit       95.4M
    Path                C:\path\*_*.Xlog.log,D:\path\*_*.Xlog.x64.log
    Multiline           off
    Parser_Firstline    xlog_parser
    Path_Key            path
    Offset_Key          1
    DB                  fluent.db
    Read_from_Head      On
    Tag_Regex           ^.*\\(?<appname>.+)_(?<pid>[0-9]+).Xlog.*.log$
    Buffer_Chunk_Size   128k
    Buffer_Max_Size     256k
    Ignore_Older        10d
    Routes              es.0

[INPUT] Instance
    Name                storage_backlog.1 (storage_backlog, id=1)
    Flags
    Threaded            No
    Tag                 storage_backlog.1
    Routes              es.0


[OUTPUT] Instance
    Name                es.0 (es, id=0)
    Match               *
    TLS Active          Yes
    TLS.Verify          Off
    TLS.Ca_File         (not set)
    TLS.Crt_File        (not set)
    TLS.Key_File        (not set)
    TLS.Key_Passwd      (not set)
    Retry Limit         no limit
    Host.TCP_Port       443
    Host.Name           abcd.eu-west-1.es.amazonaws.com
    Index               fluent-bit
    Logstash_Format     true
    HTTP_User           aaa
    HTTP_Passwd         bbb
    Trace_Output        On
    Trace_Error         On
    Buffer_Size         False

fluent-bit-too-many-open-chunks.log

@MPeli
Copy link
Author

MPeli commented May 25, 2021

To Reproduce

  • Prepare 120 files (~550Mb it total) which will be monitored
  • set storage.max_chunks_up 32
  • delete db files and backlog
  • Go offline
  • start fluent bit
  • wait till you get too many open chunks on and there are no new log lines in fluent bit log. Fluent bit will seem to be stuck.
  • Go online (it has no effect)

fluent-bit.log

@agup006 agup006 added the Windows Bugs and requests about Windows platforms label May 25, 2021
@igorpeshansky
Copy link
Contributor

Seems to be the same as #3338

@josephflynch
Copy link

josephflynch commented Jun 8, 2021

This is affecting GCP customer(s).

@niedbalski
Copy link
Collaborator

Hello @MPeli

I am following the reproducer procedure outlined in your last comment but on linux using the latest master build.

So far the issue isn't reproducible, can you please review the following procedure and
configuration matches.

  1. Created 120 files of ~600MiB.
$ ls -lh /data/*.xlog.log | wc -l
120

$ du -ach /data/*.xlog.log | grep -i total
606M	total

2.1 set storage.max_chunks_up 32

$ docker-compose logs fluentbit | grep -i chunks
fluentbit               | [2021/06/09 00:16:38] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=32
  1. Configuration file in use
[SERVICE]
	Flush               5.000000
	Daemon              Off
	Log_Level           Trace
	storage.max_chunks_up     32

[INPUT]
	Name                tail
	Mem_Buf_limit       95.4M
	Path                /data/*.xlog.log
	Multiline	    off
	Parser_Firstline    xlog_parser
	Path_Key            path
	Offset_Key          1
	DB                  fluent.db
	Read_from_Head      On
	Buffer_Chunk_Size   128k
	Buffer_Max_Size     256k
	Ignore_Older        10d

[OUTPUT]
	Name        es
	Match *
	Host elasticsearch-master
	Port 9200
	Index fluentbit

@qingling128
Copy link
Collaborator

We observed the same issue with 1.7.6, but it was fixed by 1.7.8. Please give that a try.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 10, 2021
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@jkschulz
Copy link

jkschulz commented Dec 7, 2021

This doesn't seem to be fixed in 1.7.8 or 1.8.4. I tested a custom build of 1.8.4 w/ chunkio 1.1.5 on Windows 2016, and when I forced it to lose connection (by messing up DNS resolver), the CPU was driven way up, logs started growing with errors, and FB kept restarting. Some examples of logs:

  • [error] [upstream] connection #-1 to logging.googleapis.com:443 timed out after 10 seconds
  • [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=11): Could not contact DNS servers
  • [error] [storage] [cio file] hash does not match (2329640969 != 1104745215)
  • [error] [storage] [cio file] cannot verify checksum on 'tail.1/3792-1638856019.33686500.flb' (cio_file_open() line=386)

After a bit, started to see [error] [storage] [cio file] too many open chunks on 'tail.1/6392-1638856403.90869100.flb' (cio_file_up() line=791) in the logs. Once I saw those, I restored connection. CPU dropped back down, but the too many open chunks errors and others continued in the logs, and a few logs (202) were able to get through that came from that tail input, before they suddenly stopped and no more were sent. Another input, winlog (which sent logs with a diff output plugin from the tail input) was able to send logs as expected after connection was restored.

Links

Steps to reproduce

  1. Create Windows 2016 VM
$ gcloud compute instances create win2016 --image-project windows-cloud --image-family windows-2016 --machine-type=n1-standard-16
# Write down username+password from output for use later
$ gcloud compute reset-windows-password --quiet win2016   
# For connecting via Microsoft Remote Desktop
$ gcloud compute start-iap-tunnel --local-host-port=localhost:13000 win2016  3389
  1. Connect via Microsoft Remote Desktop using username and password from setup
  2. Open Powershell as admin, and install custom Ops Agent build
> (New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1")
> $env:REPO_SUFFIX="20211201-2"; & $env:UserProfile\add-google-cloud-ops-agent-repo.ps1 -AlsoInstall
  1. Mess up DNS. Make sure to close each of these windows after making the changes, otherwise it will not take effect.
  1. Observe logs. Observe CPU continuously increasing for Fluent Bit in Task Manager. Wait until you see too many open chunks in logs.
> notepad C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log
  1. Restore connection by reverting changes from 2 steps ago. Set IPv4 properties back to Obtain DNS server address automatically.

  2. Observe that CPU for Fluent Bit drops back down. Too many open chunks messages continue getting logged. Check Logs UI. Observe that some logs were ingested right after connection was restored, but then they suddenly stop and never return. The logs that aren't ingested at this point start at:

[2021/12/07 06:01:35] [ info] [input:storage_backlog:storage_backlog.2] queueing tail.1:5904-1638855941.703861300.flb

image

  1. Create Windows Event log and observe that it is sent successfully.
> New-Eventlog –LogName Application –Source “oa-test-logger”
> Write-Eventlog –LogName Application –Source “oa-test-logger” –EntryType Error –EventID 1 -Message “after reconnecting"

image

@jkschulz
Copy link

jkschulz commented Dec 7, 2021

Also a while ago I tested this on Fluent Bit 1.7.8 (via Ops Agent 2.0.3) and it failed in the same way.
Logs blew up when network connection was disrupted. After disconnecting, there were errors like connection #1300 to logging.googleapis.com:443 timed out after 10 seconds in the logs, then after a few mins of that, lots of errors like [error] [storage] [cio file] too many open chunks on. When i finally reconnected, the too many open chunks errors continued, and I don't think the connection was reestablished bc no new logs made it through to cloud logging

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Stale Windows Bugs and requests about Windows platforms
Projects
None yet
Development

No branches or pull requests

7 participants