Known Issue: Data is not flowing under some conditions #4505

lecaros · 2021-12-21T22:41:30Z

We have received some reports that data is not flowing under certain circumstances. Below you'll see some of them:

Condition	Issue #	Workaround	Version
After network timeout	#4503 #3541		1.8.11
When endpoint (splunk, loki) is unavailable	#4497 #3328 #3014		1.8.11
After failed to flush chunk (and others)	#3581	Restart fluent-bit	1.7.x
When connection to fluentd is lost	#1022 #4313	Restart fluent-bit
After error "Bad file descriptor" if thrown	#3540	Should be solved in 1.7
When tailing from files with accumulated data (i.e. fluent-bit started after data starts to pile up)		Data starts to flow after a while	1.8.11

If you're experiencing any of these or similar, please provide us with steps to reproduce to troubleshoot and validate proposed fixes.

Update 03/22: with PRs #5109 and #5111 we're providing fixes for the network-related issues listed here. If you were affected by any of the items on the list (including non-network-related) please test them.

briandefiant · 2021-12-23T15:27:54Z

I'm seeing an issue with 1.8.11 and the S3 output where data stops flowing after the following errors:

[2021/12/22 23:16:43] [error] [tls] SSL error: SSL - The peer notified us that the connection is going to be closed
[2021/12/22 23:16:43] [error] [http_client] broken connection to s3.redacted.amazonaws.com:443 ?
[2021/12/22 23:16:43] [error] [http_client] broken connection to s3.redacted.amazonaws.com:443 ?
[2021/12/22 23:16:43] [error] [output:s3:s3.0] PutObject request failed
[2021/12/22 23:16:43] [error] [output:s3:s3.0] Could not send chunk with tag auditd

The log then starts filling up with:

[2021/12/22 23:17:23] [error] [input chunk] no enough space in filesystem to buffer chunk 2406707-1640215043.826250542.flb in plugin s3.0
[2021/12/22 23:17:23] [error] [input chunk] no available chunk
[2021/12/22 23:17:28] [error] [input chunk] no enough space in filesystem to buffer chunk 2406707-1640215048.897513660.flb in plugin s3.0
[2021/12/22 23:17:28] [error] [input chunk] no available chunk
[2021/12/22 23:17:29] [error] [input chunk] no enough space in filesystem to buffer chunk 2406707-1640215049.823568646.flb in plugin s3.0
[2021/12/22 23:17:29] [error] [input chunk] no available chunk

On high traffic systems, the log file grows in size very quickly as it fills with these errors (about 2 GB per day). Restarting the Fluent Bit service resolves the issue.

There are also PutObject errors interspersed with the no available chunk errors. For example:

[2021/12/22 23:18:13] [error] [output:s3:s3.0] PutObject request failed
[2021/12/22 23:18:13] [error] [output:s3:s3.0] Could not send chunk with tag O  ^RЅ<8E><8B><D7>
[2021/12/22 23:18:23] [error] [output:s3:s3.0] PutObject API responded with error='HttpVersionNotSupported', message='The HTTP version specified is not supported.'
[2021/12/22 23:18:23] [error] [output:s3:s3.0] Raw PutObject response: HTTP/1.1 505 HTTP Version Not Supported
x-amz-request-id: redacted
x-amz-id-2: redacted
Date: Wed, 22 Dec 2021 23:18:23 GMT
Server: AmazonS3
Content-Type: application/xml
Transfer-Encoding: chunked
Connection: close

<?xml version="1.0" encoding="UTF-8"?><Error><Code>HttpVersionNotSupported</Code><Message>The HTTP version specified is not supported.</Message><RequestId>redacted</RequestId><HostId>redacted/</HostId></Error>
[2021/12/22 23:18:23] [error] [output:s3:s3.0] PutObject request failed
[2021/12/22 23:18:23] [error] [output:s3:s3.0] Could not send chunk with tag O  ^RЅ<8E><8B><D7>

It's not always HTTPVersionNotSupported errors. On another system with similar symptoms we see this message:

[2021/12/23 07:29:19] [error] [output:s3:s3.0] PutObject API responded with error='SignatureDoesNotMatch', message='The request signature we calculated does not match the signature you provided. Check your key and signing method.'
[2021/12/23 07:29:19] [error] [output:s3:s3.0] Raw PutObject response: HTTP/1.1 403 Forbidden
x-amz-request-id: redacted
x-amz-id-2: redacted
Content-Type: application/xml
Transfer-Encoding: chunked
Date: Thu, 23 Dec 2021 07:29:18 GMT
Server: AmazonS3
Connection: close

<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>SignatureDoesNotMatch</Code><Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message><AWSAccessKeyId>redacted</AWSAccessKeyId><StringToSign>AWS4-HMAC-SHA256
20211223T072919Z
20211223/redacted/s3/aws4_request
redacted</StringToSign><SignatureProvided>redacted</SignatureProvided><StringToSignBytes>redacted</StringToSignBytes><CanonicalRequest>PUT
/redacted/fluent-bit-logs/%1A%C3%82%C2%BA%C3%82%C2%80%10%C3%83%C2%8E%C3%82%C2%9E%05ccess%C3%82%C2%92%C3%83%C2%97/2021/12/23/07/28/WETNO7FN

content-length:1710593
host:s3.redacted.amazonaws.com
user-agent:aws-fluent-bit-plugin
x-amz-acl:bucket-owner-full-control
x-amz-content-sha256:redacted
x-amz-date:20211223T072919Z
x-amz-security-token:redacted

content-length;host;user-agent;x-amz-acl;x-amz-content-sha256;x-amz-date;x-amz-security-token
redacted</CanonicalRequest><CanonicalRequestBytes>redacted
[2021/12/23 07:29:19] [error] [output:s3:s3.0] PutObject request failed
[2021/12/23 07:29:19] [error] [output:s3:s3.0] Could not send chunk with tag ^Z<BA><80>^PΞ^Eccess<92><D7>

Running kill -CONT pidof td-agent-bit`` while this error is occurring doesn't appear to show any issues:

Dec 23 14:56:43 redacted td-agent-bit[2406707]: [2021/12/23 14:56:43] [engine] caught signal (SIGCONT)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: [2021/12/23 14:56:43] Fluent Bit Dump
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ===== Input =====
Dec 23 14:56:43 redacted td-agent-bit[2406707]: tail.0 (tail)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ status
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ overlimit     : no
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     ├─ mem size   : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     └─ mem limit  : 30.5M (32000000 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ tasks
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ total tasks   : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ new           : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ running       : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ size          : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: └─ chunks
Dec 23 14:56:43 redacted td-agent-bit[2406707]:    └─ total chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ up chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ down chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       └─ busy chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          ├─ size    : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          └─ size err: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: tail.1 (tail)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ status
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ overlimit     : no
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     ├─ mem size   : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     └─ mem limit  : 30.5M (32000000 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ tasks
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ total tasks   : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ new           : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ running       : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ size          : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: └─ chunks
Dec 23 14:56:43 redacted td-agent-bit[2406707]:    └─ total chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ up chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ down chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       └─ busy chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          ├─ size    : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          └─ size err: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: tail.2 (tail)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ status
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ overlimit     : no
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     ├─ mem size   : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     └─ mem limit  : 30.5M (32000000 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ tasks
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ total tasks   : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ new           : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ running       : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ size          : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: └─ chunks
Dec 23 14:56:43 redacted td-agent-bit[2406707]:    └─ total chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ up chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ down chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       └─ busy chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          ├─ size    : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          └─ size err: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: storage_backlog.3 (storage_backlog)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ status
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ overlimit     : no
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     ├─ mem size   : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     └─ mem limit  : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ tasks
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ total tasks   : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ new           : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ running       : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ size          : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: └─ chunks
Dec 23 14:56:43 redacted td-agent-bit[2406707]:    └─ total chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ up chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ down chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       └─ busy chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          ├─ size    : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          └─ size err: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ===== Storage Layer =====
Dec 23 14:56:43 redacted td-agent-bit[2406707]: total chunks     : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ mem chunks    : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: └─ fs chunks     : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:    ├─ up         : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:    └─ down       : 0

Our S3 output configuration:

[OUTPUT]
    Name s3
    Match *
    bucket redacted
    region redacted
    s3_key_format /fluent-bit-logs/$TAG/%Y/%m/%d/%H/%M/$UUID
    canned_acl bucket-owner-full-control
    total_file_size 128M
    upload_timeout 1m
    auto_retry_requests true
    storage.total_limit_size 512M

Let me know if any additional information would be helpful. I'm hoping this PR might fix the issue, but I thought I would report it anyway.

It's difficult to capture debug logs when this occurs, since it happens seemingly at random, and the debug logs would grow in size too quickly on our high traffic systems where the issue happens most often.

elruwen · 2022-01-05T02:17:03Z

As part of the reproduction of #3014 I created a few tests that reproduce the issues. Chunks are also getting stuck under high load with the destination up all the time.

JeffLuoo · 2022-01-11T19:18:17Z

Hi @briandefiant

You are setting

    storage.total_limit_size 512M

in your storage configuration and this will cause the warning no enough space in filesystem xxx if the size of files stored in you disk exceeds the limit you set. Could you please check the disk size under the filesystem path you configure in the service section to see if it is bigger than the limit.

briandefiant · 2022-01-11T19:20:55Z

Thanks, @JeffLuoo, I should have mentioned that. When this error has occurred in the past, I've checked Fluent Bit's storage directory and it was only a few MBs. For example:

# du -h /run/fluent-bit/storage/ | tail -1
3.0M    /run/fluent-bit/storage/

rhavenn · 2022-01-14T18:46:21Z

I have this issue as well. Output is Azure Log Analytics. syslog-ng is load balancing to 4 different FB (Fluentbit) systems via syslog listener for some Cisco ASA logs. I'm only keeping chunks in RAM. Debian 11.2 OS with the official package from the FB repo on Azure VMs. 4 cores and 16GB RAM.

I get errors like the following:

2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds
2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds
2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds
2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds
2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds
2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds
2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds

My config:
storage.max_chunks_up 1024 and otherwise pretty stock td-agent-bit.conf. 120 second flush time.

My inputs:

[INPUT]
    Name    syslog
    Alias ciscoasa_all
    Tag ciscoasa_all
    Parser  syslog-rfc3164-cisco_asa
    Listen  0.0.0.0
    Port    17230
    Mode    tcp
    Buffer_Chunk_Size 28M
    Buffer_Max_Size 1792M

various filters on the `message` field from the syslog parser.

[OUTPUT]
    Name azure
    Alias ciscoasa_all
    Log_Type FB_CL
    Match ciscoasa_all
    Customer_ID custID
    Shared_Key mykey
    Workers 64

curl http://127.0.0.1:2020/api/v1/storage | jq shows:

{
  "storage_layer": {
    "chunks": {
      "total_chunks": 250,
      "mem_chunks": 250,
      "fs_chunks": 0,
      "fs_chunks_up": 0,
      "fs_chunks_down": 0
    }
  },
  "input_chunks": {
    "ciscoasa_all": {
      "status": {
        "overlimit": false,
        "mem_size": "487.8M",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 250,
        "up": 250,
        "down": 0,
        "busy": 137,
        "busy_size": "267.6M"
      }
    },
    "storage_backlog.1": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    }
  }
}

Each of the servers has some different levels of "stuck" chunks. They'll process the new "up" chunks in about 30-60 seconds, but the "stuck" ones will stay "busy" until a restart. syslog-ng-ctl stats is not showing anything queued or dropped, so I know it's not a overload issue. We're pushing 5-10mil records every 5 minutes.

zhming0 · 2022-01-29T06:10:55Z

The "When endpoint (splunk, loki) is unavailable" issue seems to be fixed by 1.8.12. Can't re-produce the issue anymore.

briandefiant · 2022-02-10T16:00:13Z

We upgraded a few high traffic systems yesterday and so far haven't run into any no available chunk errors after about 24 hours. I'm hoping #4318 fixed things for us 🤞.

lecaros · 2022-03-22T13:49:57Z

Hello, everyone!

First, thanks for your reports and the repro cases you've provided.

We’ve submitted some PRs that deal with most of the issues gathered on this ticket.
The last ones, #5109 and #5111, not yet merged, will deal with network-related issues.
“When tailing from files…” also received some love through PR’s already merged in 1.8.14 and 1.9.0.

If you were affected by any of these, please test the aforementioned PRs.
We appreciate your reports, and we’ll thankfully receive any feedback you may have running these versions.
Pending PRs should be available by 1.8.15 and 1.9.1, respectively.

lecaros · 2022-03-28T20:34:36Z

Hi!
We've released the fixes in 1.8.15 and 1.9.1. With this, we're closing this issue.
Thank you, everyone, for your repro scenarios and collaboration during troubleshooting and fixing. We have a fantastic community here!

zhanghe9702 mentioned this issue Dec 22, 2021

fluent-bit in dead status, looks like too many ESTABLISHED tcp connection not release #4507

Closed

lecaros pinned this issue Dec 23, 2021

kasia-kujawa mentioned this issue Feb 14, 2022

Update fluent-bit version in sumologic-kubernetes-collection helm chart SumoLogic/sumologic-kubernetes-collection#2107

Closed

lecaros closed this as completed Mar 28, 2022

lecaros unpinned this issue Mar 28, 2022

lecaros changed the title ~~Known Issue: Data is not flowing under some conditions -WIP-~~ Known Issue: Data is not flowing under some conditions Mar 28, 2022

superseb mentioned this issue Mar 31, 2022

fluentd stops sending logs to elasticsearch rancher/rancher#35395

Closed

This was referenced Apr 12, 2022

Broken Connection Errors Resolved with Keep alive config #4707

Closed

Fluent bit doesnt forward logs and memory consumption of fluent-bit keeps steadily growing in a loaded k8s cluster #4760

Closed

jackgill mentioned this issue Feb 28, 2023

S3 output intermittently fails with errors SignatureDoesNotMatch, broken pipe, or HTTP version error #6933

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Known Issue: Data is not flowing under some conditions #4505

Known Issue: Data is not flowing under some conditions #4505

lecaros commented Dec 21, 2021 •

edited

Loading

briandefiant commented Dec 23, 2021 •

edited

Loading

elruwen commented Jan 5, 2022

JeffLuoo commented Jan 11, 2022

briandefiant commented Jan 11, 2022

rhavenn commented Jan 14, 2022

zhming0 commented Jan 29, 2022

briandefiant commented Feb 10, 2022

lecaros commented Mar 22, 2022

lecaros commented Mar 28, 2022

Known Issue: Data is not flowing under some conditions #4505

Known Issue: Data is not flowing under some conditions #4505

Comments

lecaros commented Dec 21, 2021 • edited Loading

briandefiant commented Dec 23, 2021 • edited Loading

elruwen commented Jan 5, 2022

JeffLuoo commented Jan 11, 2022

briandefiant commented Jan 11, 2022

rhavenn commented Jan 14, 2022

zhming0 commented Jan 29, 2022

briandefiant commented Feb 10, 2022

lecaros commented Mar 22, 2022

lecaros commented Mar 28, 2022

lecaros commented Dec 21, 2021 •

edited

Loading

briandefiant commented Dec 23, 2021 •

edited

Loading