Skip to content

Known Issue: Data is not flowing under some conditions #4505

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
lecaros opened this issue Dec 21, 2021 · 9 comments
Closed

Known Issue: Data is not flowing under some conditions #4505

lecaros opened this issue Dec 21, 2021 · 9 comments

Comments

@lecaros
Copy link
Contributor

lecaros commented Dec 21, 2021

We have received some reports that data is not flowing under certain circumstances. Below you'll see some of them:

Condition Issue # Workaround Version
After network timeout #4503 #3541 1.8.11
When endpoint (splunk, loki) is unavailable #4497 #3328 #3014 1.8.11
After failed to flush chunk (and others) #3581 Restart fluent-bit 1.7.x
When connection to fluentd is lost #1022 #4313 Restart fluent-bit
After error "Bad file descriptor" if thrown #3540 Should be solved in 1.7
When tailing from files with accumulated data (i.e. fluent-bit started after data starts to pile up) Data starts to flow after a while 1.8.11

If you're experiencing any of these or similar, please provide us with steps to reproduce to troubleshoot and validate proposed fixes.

Update 03/22: with PRs #5109 and #5111 we're providing fixes for the network-related issues listed here. If you were affected by any of the items on the list (including non-network-related) please test them.

@briandefiant
Copy link

briandefiant commented Dec 23, 2021

I'm seeing an issue with 1.8.11 and the S3 output where data stops flowing after the following errors:

[2021/12/22 23:16:43] [error] [tls] SSL error: SSL - The peer notified us that the connection is going to be closed
[2021/12/22 23:16:43] [error] [http_client] broken connection to s3.redacted.amazonaws.com:443 ?
[2021/12/22 23:16:43] [error] [http_client] broken connection to s3.redacted.amazonaws.com:443 ?
[2021/12/22 23:16:43] [error] [output:s3:s3.0] PutObject request failed
[2021/12/22 23:16:43] [error] [output:s3:s3.0] Could not send chunk with tag auditd

The log then starts filling up with:

[2021/12/22 23:17:23] [error] [input chunk] no enough space in filesystem to buffer chunk 2406707-1640215043.826250542.flb in plugin s3.0
[2021/12/22 23:17:23] [error] [input chunk] no available chunk
[2021/12/22 23:17:28] [error] [input chunk] no enough space in filesystem to buffer chunk 2406707-1640215048.897513660.flb in plugin s3.0
[2021/12/22 23:17:28] [error] [input chunk] no available chunk
[2021/12/22 23:17:29] [error] [input chunk] no enough space in filesystem to buffer chunk 2406707-1640215049.823568646.flb in plugin s3.0
[2021/12/22 23:17:29] [error] [input chunk] no available chunk

On high traffic systems, the log file grows in size very quickly as it fills with these errors (about 2 GB per day). Restarting the Fluent Bit service resolves the issue.

There are also PutObject errors interspersed with the no available chunk errors. For example:

[2021/12/22 23:18:13] [error] [output:s3:s3.0] PutObject request failed
[2021/12/22 23:18:13] [error] [output:s3:s3.0] Could not send chunk with tag O  ^RЅ<8E><8B><D7>
[2021/12/22 23:18:23] [error] [output:s3:s3.0] PutObject API responded with error='HttpVersionNotSupported', message='The HTTP version specified is not supported.'
[2021/12/22 23:18:23] [error] [output:s3:s3.0] Raw PutObject response: HTTP/1.1 505 HTTP Version Not Supported
x-amz-request-id: redacted
x-amz-id-2: redacted
Date: Wed, 22 Dec 2021 23:18:23 GMT
Server: AmazonS3
Content-Type: application/xml
Transfer-Encoding: chunked
Connection: close

<?xml version="1.0" encoding="UTF-8"?><Error><Code>HttpVersionNotSupported</Code><Message>The HTTP version specified is not supported.</Message><RequestId>redacted</RequestId><HostId>redacted/</HostId></Error>
[2021/12/22 23:18:23] [error] [output:s3:s3.0] PutObject request failed
[2021/12/22 23:18:23] [error] [output:s3:s3.0] Could not send chunk with tag O  ^RЅ<8E><8B><D7>

It's not always HTTPVersionNotSupported errors. On another system with similar symptoms we see this message:

[2021/12/23 07:29:19] [error] [output:s3:s3.0] PutObject API responded with error='SignatureDoesNotMatch', message='The request signature we calculated does not match the signature you provided. Check your key and signing method.'
[2021/12/23 07:29:19] [error] [output:s3:s3.0] Raw PutObject response: HTTP/1.1 403 Forbidden
x-amz-request-id: redacted
x-amz-id-2: redacted
Content-Type: application/xml
Transfer-Encoding: chunked
Date: Thu, 23 Dec 2021 07:29:18 GMT
Server: AmazonS3
Connection: close

<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>SignatureDoesNotMatch</Code><Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message><AWSAccessKeyId>redacted</AWSAccessKeyId><StringToSign>AWS4-HMAC-SHA256
20211223T072919Z
20211223/redacted/s3/aws4_request
redacted</StringToSign><SignatureProvided>redacted</SignatureProvided><StringToSignBytes>redacted</StringToSignBytes><CanonicalRequest>PUT
/redacted/fluent-bit-logs/%1A%C3%82%C2%BA%C3%82%C2%80%10%C3%83%C2%8E%C3%82%C2%9E%05ccess%C3%82%C2%92%C3%83%C2%97/2021/12/23/07/28/WETNO7FN

content-length:1710593
host:s3.redacted.amazonaws.com
user-agent:aws-fluent-bit-plugin
x-amz-acl:bucket-owner-full-control
x-amz-content-sha256:redacted
x-amz-date:20211223T072919Z
x-amz-security-token:redacted

content-length;host;user-agent;x-amz-acl;x-amz-content-sha256;x-amz-date;x-amz-security-token
redacted</CanonicalRequest><CanonicalRequestBytes>redacted
[2021/12/23 07:29:19] [error] [output:s3:s3.0] PutObject request failed
[2021/12/23 07:29:19] [error] [output:s3:s3.0] Could not send chunk with tag ^Z<BA><80>^PΞ^Eccess<92><D7>

Running kill -CONT pidof td-agent-bit`` while this error is occurring doesn't appear to show any issues:

Dec 23 14:56:43 redacted td-agent-bit[2406707]: [2021/12/23 14:56:43] [engine] caught signal (SIGCONT)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: [2021/12/23 14:56:43] Fluent Bit Dump
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ===== Input =====
Dec 23 14:56:43 redacted td-agent-bit[2406707]: tail.0 (tail)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ status
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ overlimit     : no
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     ├─ mem size   : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     └─ mem limit  : 30.5M (32000000 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ tasks
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ total tasks   : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ new           : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ running       : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ size          : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: └─ chunks
Dec 23 14:56:43 redacted td-agent-bit[2406707]:    └─ total chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ up chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ down chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       └─ busy chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          ├─ size    : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          └─ size err: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: tail.1 (tail)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ status
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ overlimit     : no
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     ├─ mem size   : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     └─ mem limit  : 30.5M (32000000 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ tasks
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ total tasks   : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ new           : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ running       : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ size          : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: └─ chunks
Dec 23 14:56:43 redacted td-agent-bit[2406707]:    └─ total chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ up chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ down chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       └─ busy chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          ├─ size    : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          └─ size err: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: tail.2 (tail)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ status
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ overlimit     : no
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     ├─ mem size   : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     └─ mem limit  : 30.5M (32000000 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ tasks
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ total tasks   : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ new           : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ running       : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ size          : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: └─ chunks
Dec 23 14:56:43 redacted td-agent-bit[2406707]:    └─ total chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ up chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ down chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       └─ busy chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          ├─ size    : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          └─ size err: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: storage_backlog.3 (storage_backlog)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ status
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ overlimit     : no
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     ├─ mem size   : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │     └─ mem limit  : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ tasks
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ total tasks   : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ new           : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  ├─ running       : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │  └─ size          : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]: │
Dec 23 14:56:43 redacted td-agent-bit[2406707]: └─ chunks
Dec 23 14:56:43 redacted td-agent-bit[2406707]:    └─ total chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ up chunks  : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       ├─ down chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:       └─ busy chunks: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          ├─ size    : 0b (0 bytes)
Dec 23 14:56:43 redacted td-agent-bit[2406707]:          └─ size err: 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ===== Storage Layer =====
Dec 23 14:56:43 redacted td-agent-bit[2406707]: total chunks     : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: ├─ mem chunks    : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]: └─ fs chunks     : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:    ├─ up         : 0
Dec 23 14:56:43 redacted td-agent-bit[2406707]:    └─ down       : 0

Our S3 output configuration:

[OUTPUT]
    Name s3
    Match *
    bucket redacted
    region redacted
    s3_key_format /fluent-bit-logs/$TAG/%Y/%m/%d/%H/%M/$UUID
    canned_acl bucket-owner-full-control
    total_file_size 128M
    upload_timeout 1m
    auto_retry_requests true
    storage.total_limit_size 512M

Let me know if any additional information would be helpful. I'm hoping this PR might fix the issue, but I thought I would report it anyway.

It's difficult to capture debug logs when this occurs, since it happens seemingly at random, and the debug logs would grow in size too quickly on our high traffic systems where the issue happens most often.

@elruwen
Copy link

elruwen commented Jan 5, 2022

As part of the reproduction of #3014 I created a few tests that reproduce the issues. Chunks are also getting stuck under high load with the destination up all the time.

@JeffLuoo
Copy link
Contributor

Hi @briandefiant

You are setting

    storage.total_limit_size 512M

in your storage configuration and this will cause the warning no enough space in filesystem xxx if the size of files stored in you disk exceeds the limit you set. Could you please check the disk size under the filesystem path you configure in the service section to see if it is bigger than the limit.

@briandefiant
Copy link

Thanks, @JeffLuoo, I should have mentioned that. When this error has occurred in the past, I've checked Fluent Bit's storage directory and it was only a few MBs. For example:

# du -h /run/fluent-bit/storage/ | tail -1
3.0M    /run/fluent-bit/storage/

@rhavenn
Copy link

rhavenn commented Jan 14, 2022

I have this issue as well. Output is Azure Log Analytics. syslog-ng is load balancing to 4 different FB (Fluentbit) systems via syslog listener for some Cisco ASA logs. I'm only keeping chunks in RAM. Debian 11.2 OS with the official package from the FB repo on Azure VMs. 4 cores and 16GB RAM.

I get errors like the following:

2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds
2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds
2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds
2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds
2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds
2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds
2022-01-14T05:19:44-0900 oitssomclogfbp2 td-agent-bit[54917]: [2022/01/14 05:19:44] [error] [upstream] connection #-1 to siteidhere.ods.opinsights.azure.com:443 timed out after 10 seconds

My config:
storage.max_chunks_up 1024 and otherwise pretty stock td-agent-bit.conf. 120 second flush time.

My inputs:

[INPUT]
    Name    syslog
    Alias ciscoasa_all
    Tag ciscoasa_all
    Parser  syslog-rfc3164-cisco_asa
    Listen  0.0.0.0
    Port    17230
    Mode    tcp
    Buffer_Chunk_Size 28M
    Buffer_Max_Size 1792M

various filters on the `message` field from the syslog parser.

[OUTPUT]
    Name azure
    Alias ciscoasa_all
    Log_Type FB_CL
    Match ciscoasa_all
    Customer_ID custID
    Shared_Key mykey
    Workers 64

curl http://127.0.0.1:2020/api/v1/storage | jq shows:

{
  "storage_layer": {
    "chunks": {
      "total_chunks": 250,
      "mem_chunks": 250,
      "fs_chunks": 0,
      "fs_chunks_up": 0,
      "fs_chunks_down": 0
    }
  },
  "input_chunks": {
    "ciscoasa_all": {
      "status": {
        "overlimit": false,
        "mem_size": "487.8M",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 250,
        "up": 250,
        "down": 0,
        "busy": 137,
        "busy_size": "267.6M"
      }
    },
    "storage_backlog.1": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    }
  }
}

Each of the servers has some different levels of "stuck" chunks. They'll process the new "up" chunks in about 30-60 seconds, but the "stuck" ones will stay "busy" until a restart. syslog-ng-ctl stats is not showing anything queued or dropped, so I know it's not a overload issue. We're pushing 5-10mil records every 5 minutes.

@zhming0
Copy link

zhming0 commented Jan 29, 2022

The "When endpoint (splunk, loki) is unavailable" issue seems to be fixed by 1.8.12. Can't re-produce the issue anymore.

@briandefiant
Copy link

We upgraded a few high traffic systems yesterday and so far haven't run into any no available chunk errors after about 24 hours. I'm hoping #4318 fixed things for us 🤞.

@lecaros
Copy link
Contributor Author

lecaros commented Mar 22, 2022

Hello, everyone!

First, thanks for your reports and the repro cases you've provided.

We’ve submitted some PRs that deal with most of the issues gathered on this ticket.
The last ones, #5109 and #5111, not yet merged, will deal with network-related issues.
“When tailing from files…” also received some love through PR’s already merged in 1.8.14 and 1.9.0.

If you were affected by any of these, please test the aforementioned PRs.
We appreciate your reports, and we’ll thankfully receive any feedback you may have running these versions.
Pending PRs should be available by 1.8.15 and 1.9.1, respectively.

@lecaros
Copy link
Contributor Author

lecaros commented Mar 28, 2022

Hi!
We've released the fixes in 1.8.15 and 1.9.1. With this, we're closing this issue.
Thank you, everyone, for your repro scenarios and collaboration during troubleshooting and fixing. We have a fantastic community here!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants