-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Known Issue: Data is not flowing under some conditions #4505
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
I'm seeing an issue with 1.8.11 and the S3 output where data stops flowing after the following errors:
The log then starts filling up with:
On high traffic systems, the log file grows in size very quickly as it fills with these errors (about 2 GB per day). Restarting the Fluent Bit service resolves the issue. There are also
It's not always
Running
Our S3 output configuration:
Let me know if any additional information would be helpful. I'm hoping this PR might fix the issue, but I thought I would report it anyway. It's difficult to capture debug logs when this occurs, since it happens seemingly at random, and the debug logs would grow in size too quickly on our high traffic systems where the issue happens most often. |
As part of the reproduction of #3014 I created a few tests that reproduce the issues. Chunks are also getting stuck under high load with the destination up all the time. |
You are setting
in your storage configuration and this will cause the warning |
Thanks, @JeffLuoo, I should have mentioned that. When this error has occurred in the past, I've checked Fluent Bit's storage directory and it was only a few MBs. For example:
|
I have this issue as well. Output is Azure Log Analytics. syslog-ng is load balancing to 4 different FB (Fluentbit) systems via syslog listener for some Cisco ASA logs. I'm only keeping chunks in RAM. Debian 11.2 OS with the official package from the FB repo on Azure VMs. 4 cores and 16GB RAM. I get errors like the following:
My config: My inputs:
Each of the servers has some different levels of "stuck" chunks. They'll process the new "up" chunks in about 30-60 seconds, but the "stuck" ones will stay "busy" until a restart. |
The "When endpoint (splunk, loki) is unavailable" issue seems to be fixed by 1.8.12. Can't re-produce the issue anymore. |
We upgraded a few high traffic systems yesterday and so far haven't run into any |
Hello, everyone! First, thanks for your reports and the repro cases you've provided. We’ve submitted some PRs that deal with most of the issues gathered on this ticket. If you were affected by any of these, please test the aforementioned PRs. |
Hi! |
We have received some reports that data is not flowing under certain circumstances. Below you'll see some of them:
If you're experiencing any of these or similar, please provide us with steps to reproduce to troubleshoot and validate proposed fixes.
Update 03/22: with PRs #5109 and #5111 we're providing fixes for the network-related issues listed here. If you were affected by any of the items on the list (including non-network-related) please test them.
The text was updated successfully, but these errors were encountered: