[http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 #354

LucasHantz · 2022-05-24T06:25:37Z

Describe the question/issue

I'm having a broken connection error to Firehose and Cloudwatch on containers with low traffic as they are in a staging environment.
Once they log this error, the RAM usage is only growing until they reach the maximum threshold and the task is killed.

Configuration

[SERVICE]
    Parsers_File /parser.conf
    Streams_File /stream_processing.conf
    Flush 1
    Grace 30

    ## FB Metrics
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020

[INPUT]
    Name tcp
    Alias tcp.atom
    Listen 127.0.0.1
    Port 5170
    Chunk_Size 32
    Buffer_Size 64
    Format json
    Tag application

[FILTER]
    Name parser
    Match platform*
    Key_Name log
    Parser json
    Reserve_Data True

[FILTER]
    Name modify
    Match application*
    Rename ecs_task_arn task_id

[OUTPUT]
    Name kinesis_firehose
    Alias kinesis.atom-logs
    Match application.logs*
    region ${AWS_REGION}
    delivery_stream atom-logs
    workers 1

[OUTPUT]
    Name kinesis_firehose
    Alias kinesis.atom-metrics
    Match application.metrics*
    region ${AWS_REGION}
    delivery_stream atom-metrics
    workers 1

### METRICS ###

# Configure FB to scrape its own prom metrics
[INPUT]
    Name exec
    Alias exec.metric
    Command curl -s http://127.0.0.1:2020/api/v1/metrics/prometheus
    Interval_Sec 30
    Tag fb_metrics

# Filter out everything except output metrics
# Customize this to change which metrics are sent
[FILTER]
    Name grep
    Match fb_metrics
    Regex exec (input|output)

# Filter out the HELP and TYPE fields which aren't parseable by the cw metric filter
[FILTER]
    Name grep
    Match fb_metrics
    Exclude exec HELP

[FILTER]
    Name grep
    Match fb_metrics
    Exclude exec TYPE

# Parse the metrics to json for easy parsing in CW Log Group Metrics filter
[FILTER]
    Name parser
    Match fb_metrics
    Key_Name exec
    Parser fluentbit_prom_metrics_to_json
    Reserve_Data True

# Send the metrics as CW Logs
# The CW Metrics filter on the log group will turn them into metrics
# Use hostname in logs to differentiate log streams per task in Fargate
# Alternative is to use: https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#templating-log-group-and-stream-names
[OUTPUT]
    Name cloudwatch_logs
    Alias cloudwatch.fb_metrics
    Match fb_metrics
    region ${AWS_REGION}
    log_group_name ${FLUENT_BIT_METRICS_LOG_GROUP}
    log_stream_name metrics
    retry_limit 2

Fluent Bit Log Output

[1mFluent Bit v1.9.3�[0m
--
* �[1m�[93mCopyright (C) 2015-2022 The Fluent Bit Authors�[0m
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/05/23 19:34:30] [ info] [fluent bit] version=1.9.3, commit=a313296229, pid=1
[2022/05/23 19:34:30] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/05/23 19:34:30] [ info] [cmetrics] version=0.3.1
[2022/05/23 19:34:30] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877
[2022/05/23 19:34:30] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock
[2022/05/23 19:34:30] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224
[2022/05/23 19:34:30] [ info] [input:tcp:tcp.atom] listening on 127.0.0.1:5170
[2022/05/23 19:34:30] [ info] [output:kinesis_firehose:kinesis.atom-logs] worker #0 started
[2022/05/23 19:34:30] [ info] [output:kinesis_firehose:kinesis.atom-metrics] worker #0 started
[2022/05/23 19:34:30] [ info] [output:null:null.3] worker #0 started
[2022/05/23 19:34:31] [ info] [output:cloudwatch_logs:cloudwatch.atom] worker #0 started
[2022/05/23 19:34:31] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2022/05/23 19:34:31] [ info] [sp] stream processor started
[2022/05/23 19:34:31] [ info] [sp] registered task: stream.logs
[2022/05/23 19:34:31] [ info] [sp] registered task: stream.metrics
[2022/05/23 19:34:32] [ info] [output:cloudwatch_logs:cloudwatch.atom] Creating log stream web/platform-firelens-0a4943272e2341abbb02344e7ee3b47d in log group /ecs/atom-platform
[2022/05/23 19:34:32] [ info] [output:cloudwatch_logs:cloudwatch.atom] Created log stream web/platform-firelens-0a4943272e2341abbb02344e7ee3b47d
[2022/05/23 19:35:01] [ info] [output:cloudwatch_logs:cloudwatch.fb_metrics] Creating log stream metrics in log group /firelens/atom-platform
[2022/05/23 19:35:01] [ info] [output:cloudwatch_logs:cloudwatch.fb_metrics] Log Stream metrics already exists
[2022/05/23 19:35:41] [error] [net] connection #43 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/23 20:40:56] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/23 20:40:56] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/23 20:40:56] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-logs] Failed to send log records to atom-logs
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-logs] Failed to send log records
[2022/05/23 20:40:56] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send log records to atom-metrics
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send log records
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send records
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-logs] Failed to send records
[2022/05/23 20:40:56] [ warn] [engine] failed to flush chunk '1-1653338455.964299341.flb', retry in 8 seconds: task_id=1, input=stream.metrics > output=kinesis.atom-metrics (out_id=1)
[2022/05/23 20:40:56] [ warn] [engine] failed to flush chunk '1-1653338455.964432859.flb', retry in 10 seconds: task_id=0, input=stream.logs > output=kinesis.atom-logs (out_id=0)
[2022/05/23 20:41:04] [ info] [engine] flush chunk '1-1653338455.964299341.flb' succeeded at retry 1: task_id=1, input=stream.metrics > output=kinesis.atom-metrics (out_id=1)
[2022/05/23 20:41:06] [ info] [engine] flush chunk '1-1653338455.964432859.flb' succeeded at retry 1: task_id=0, input=stream.logs > output=kinesis.atom-logs (out_id=0)
[2022/05/23 21:20:41] [error] [net] connection #164 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/23 22:00:11] [error] [net] connection #174 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/23 22:38:41] [error] [net] connection #189 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/24 04:50:11] [error] [net] connection #211 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/24 05:21:30] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/24 05:21:30] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/24 05:21:30] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send log records to atom-metrics
[2022/05/24 05:21:30] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send log records
[2022/05/24 05:21:30] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send records
[2022/05/24 05:21:30] [ warn] [engine] failed to flush chunk '1-1653369689.464347392.flb', retry in 8 seconds: task_id=0, input=stream.metrics > output=kinesis.atom-metrics (out_id=1)
[2022/05/24 05:21:38] [ info] [engine] flush chunk '1-1653369689.464347392.flb' succeeded at retry 1: task_id=0, input=stream.metrics > output=kinesis.atom-metrics (out_id=1)

Fluent Bit Version Info

I'm reproducing the same error with the stable and latest version of the image.

Cluster Details

ECS Fargate with awsvpc networking.
The Firehose and Cloudwatch VPC endpoints are enabled.

Application Details

NTR

Steps to reproduce issue

We have run load testing on the container with the same configuration without noticing this error, so it seems this error is happening when the throughput is low.

Related Issues

This is the new configuration I've come up with based on the recommendation given here:
#351

Let me know if I did something wrong.

The text was updated successfully, but these errors were encountered:

DrewZhang13 · 2022-05-26T07:14:00Z

This is current recommendation for CloudWatch plugin config. Could you try these config?

Also i wonder how your load testing is running? These errors shows only in low throughput but not high throughput seems not make sense to me.

LucasHantz · 2022-05-27T08:30:30Z

The above graphs are of a load test we did on our application and the metrics generated by Firelens during that time.
We had no "[http_client] broken connection" during that time but we had new errors later during that day when the cluster was idle.

From what I see in the guidance, this config helps for high throughput cases which is not the problem here. Should I try it anyway?

LucasHantz · 2022-05-27T12:22:04Z

Tried with the new config, and still seeing:
[error] [net] connection #51 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443

LucasHantz · 2022-06-02T20:00:53Z

Any new on this please?

zhonghui12 · 2022-06-08T23:36:50Z

@LucasHantz , may I confirm with you that the issue only occurs with low throughput? So you can confirm that you run fluent bit in the same way and same config and you see problems only with a lower ingestion rate right? May I know what the throughput is?

LucasHantz · 2022-06-21T21:13:17Z

The issue happens in fact both the low and high throughput. The following graph is the number of record per minute in the last 8h.

As you can see, 2 times in the last 8 hours we have fluent bit falling and not reporting any new logs.
At this time this is the error log we have:
[2022/06/21 14:15:31] [error] [upstream] connection #609 to firehose.eu-west-1.amazonaws.com:443 timed out after 10 seconds
[2022/06/21 14:15:31] [error] [aws_client] connection initialization error

Until the fluentbit container exploded in memory and force the whole task to shutdown

LucasHantz · 2022-07-20T16:50:23Z

Any thoughts on this? What I can provide more to help figure out this problem?

LucasHantz · 2022-07-22T15:53:45Z

Just saw the issue raised on fluent/fluent-bit#5705 I'm getting this error as well in our traces

LucasHantz · 2022-08-02T16:48:47Z

@PettitWesley maybe? Any way to get that pushed up in the line, it's impacting our prod and I don't see how to revert back to a stable solution on this

PettitWesley · 2022-08-02T22:20:54Z

@LucasHantz Unfortunately right now I don't have any good ideas beyond using the settings here: #340

And checking this: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#network-connection-issues

zvia-eiger-hs mentioned this issue Apr 27, 2023

[http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 #645

Open

paulfouquet mentioned this issue Dec 13, 2023

[aws-for-fluent-bit] upgrade app version aws/eks-charts#1039

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 #354

[http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 #354

LucasHantz commented May 24, 2022

DrewZhang13 commented May 26, 2022 •

edited

Loading

LucasHantz commented May 27, 2022

LucasHantz commented May 27, 2022

LucasHantz commented Jun 2, 2022

zhonghui12 commented Jun 8, 2022

LucasHantz commented Jun 21, 2022

LucasHantz commented Jul 20, 2022

LucasHantz commented Jul 22, 2022

LucasHantz commented Aug 2, 2022

PettitWesley commented Aug 2, 2022

[http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 #354

[http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 #354

Comments

LucasHantz commented May 24, 2022

Describe the question/issue

Configuration

Fluent Bit Log Output

Fluent Bit Version Info

Cluster Details

Application Details

Steps to reproduce issue

Related Issues

DrewZhang13 commented May 26, 2022 • edited Loading

LucasHantz commented May 27, 2022

LucasHantz commented May 27, 2022

LucasHantz commented Jun 2, 2022

zhonghui12 commented Jun 8, 2022

LucasHantz commented Jun 21, 2022

LucasHantz commented Jul 20, 2022

LucasHantz commented Jul 22, 2022

LucasHantz commented Aug 2, 2022

PettitWesley commented Aug 2, 2022

DrewZhang13 commented May 26, 2022 •

edited

Loading