Skip to content

[http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 #354

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
LucasHantz opened this issue May 24, 2022 · 10 comments
Open

Comments

@LucasHantz
Copy link

Describe the question/issue

I'm having a broken connection error to Firehose and Cloudwatch on containers with low traffic as they are in a staging environment.
Once they log this error, the RAM usage is only growing until they reach the maximum threshold and the task is killed.

Configuration

[SERVICE]
    Parsers_File /parser.conf
    Streams_File /stream_processing.conf
    Flush 1
    Grace 30

    ## FB Metrics
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020

[INPUT]
    Name tcp
    Alias tcp.atom
    Listen 127.0.0.1
    Port 5170
    Chunk_Size 32
    Buffer_Size 64
    Format json
    Tag application

[FILTER]
    Name parser
    Match platform*
    Key_Name log
    Parser json
    Reserve_Data True

[FILTER]
    Name modify
    Match application*
    Rename ecs_task_arn task_id

[OUTPUT]
    Name kinesis_firehose
    Alias kinesis.atom-logs
    Match application.logs*
    region ${AWS_REGION}
    delivery_stream atom-logs
    workers 1

[OUTPUT]
    Name kinesis_firehose
    Alias kinesis.atom-metrics
    Match application.metrics*
    region ${AWS_REGION}
    delivery_stream atom-metrics
    workers 1

### METRICS ###

# Configure FB to scrape its own prom metrics
[INPUT]
    Name exec
    Alias exec.metric
    Command curl -s http://127.0.0.1:2020/api/v1/metrics/prometheus
    Interval_Sec 30
    Tag fb_metrics

# Filter out everything except output metrics
# Customize this to change which metrics are sent
[FILTER]
    Name grep
    Match fb_metrics
    Regex exec (input|output)

# Filter out the HELP and TYPE fields which aren't parseable by the cw metric filter
[FILTER]
    Name grep
    Match fb_metrics
    Exclude exec HELP

[FILTER]
    Name grep
    Match fb_metrics
    Exclude exec TYPE

# Parse the metrics to json for easy parsing in CW Log Group Metrics filter
[FILTER]
    Name parser
    Match fb_metrics
    Key_Name exec
    Parser fluentbit_prom_metrics_to_json
    Reserve_Data True

# Send the metrics as CW Logs
# The CW Metrics filter on the log group will turn them into metrics
# Use hostname in logs to differentiate log streams per task in Fargate
# Alternative is to use: https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#templating-log-group-and-stream-names
[OUTPUT]
    Name cloudwatch_logs
    Alias cloudwatch.fb_metrics
    Match fb_metrics
    region ${AWS_REGION}
    log_group_name ${FLUENT_BIT_METRICS_LOG_GROUP}
    log_stream_name metrics
    retry_limit 2

Fluent Bit Log Output

[1mFluent Bit v1.9.3�[0m
--
* �[1m�[93mCopyright (C) 2015-2022 The Fluent Bit Authors�[0m
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/05/23 19:34:30] [ info] [fluent bit] version=1.9.3, commit=a313296229, pid=1
[2022/05/23 19:34:30] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/05/23 19:34:30] [ info] [cmetrics] version=0.3.1
[2022/05/23 19:34:30] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877
[2022/05/23 19:34:30] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock
[2022/05/23 19:34:30] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224
[2022/05/23 19:34:30] [ info] [input:tcp:tcp.atom] listening on 127.0.0.1:5170
[2022/05/23 19:34:30] [ info] [output:kinesis_firehose:kinesis.atom-logs] worker #0 started
[2022/05/23 19:34:30] [ info] [output:kinesis_firehose:kinesis.atom-metrics] worker #0 started
[2022/05/23 19:34:30] [ info] [output:null:null.3] worker #0 started
[2022/05/23 19:34:31] [ info] [output:cloudwatch_logs:cloudwatch.atom] worker #0 started
[2022/05/23 19:34:31] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2022/05/23 19:34:31] [ info] [sp] stream processor started
[2022/05/23 19:34:31] [ info] [sp] registered task: stream.logs
[2022/05/23 19:34:31] [ info] [sp] registered task: stream.metrics
[2022/05/23 19:34:32] [ info] [output:cloudwatch_logs:cloudwatch.atom] Creating log stream web/platform-firelens-0a4943272e2341abbb02344e7ee3b47d in log group /ecs/atom-platform
[2022/05/23 19:34:32] [ info] [output:cloudwatch_logs:cloudwatch.atom] Created log stream web/platform-firelens-0a4943272e2341abbb02344e7ee3b47d
[2022/05/23 19:35:01] [ info] [output:cloudwatch_logs:cloudwatch.fb_metrics] Creating log stream metrics in log group /firelens/atom-platform
[2022/05/23 19:35:01] [ info] [output:cloudwatch_logs:cloudwatch.fb_metrics] Log Stream metrics already exists
[2022/05/23 19:35:41] [error] [net] connection #43 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/23 20:40:56] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/23 20:40:56] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/23 20:40:56] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-logs] Failed to send log records to atom-logs
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-logs] Failed to send log records
[2022/05/23 20:40:56] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send log records to atom-metrics
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send log records
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send records
[2022/05/23 20:40:56] [error] [output:kinesis_firehose:kinesis.atom-logs] Failed to send records
[2022/05/23 20:40:56] [ warn] [engine] failed to flush chunk '1-1653338455.964299341.flb', retry in 8 seconds: task_id=1, input=stream.metrics > output=kinesis.atom-metrics (out_id=1)
[2022/05/23 20:40:56] [ warn] [engine] failed to flush chunk '1-1653338455.964432859.flb', retry in 10 seconds: task_id=0, input=stream.logs > output=kinesis.atom-logs (out_id=0)
[2022/05/23 20:41:04] [ info] [engine] flush chunk '1-1653338455.964299341.flb' succeeded at retry 1: task_id=1, input=stream.metrics > output=kinesis.atom-metrics (out_id=1)
[2022/05/23 20:41:06] [ info] [engine] flush chunk '1-1653338455.964432859.flb' succeeded at retry 1: task_id=0, input=stream.logs > output=kinesis.atom-logs (out_id=0)
[2022/05/23 21:20:41] [error] [net] connection #164 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/23 22:00:11] [error] [net] connection #174 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/23 22:38:41] [error] [net] connection #189 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/24 04:50:11] [error] [net] connection #211 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443
[2022/05/24 05:21:30] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/24 05:21:30] [error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443 ?
[2022/05/24 05:21:30] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send log records to atom-metrics
[2022/05/24 05:21:30] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send log records
[2022/05/24 05:21:30] [error] [output:kinesis_firehose:kinesis.atom-metrics] Failed to send records
[2022/05/24 05:21:30] [ warn] [engine] failed to flush chunk '1-1653369689.464347392.flb', retry in 8 seconds: task_id=0, input=stream.metrics > output=kinesis.atom-metrics (out_id=1)
[2022/05/24 05:21:38] [ info] [engine] flush chunk '1-1653369689.464347392.flb' succeeded at retry 1: task_id=0, input=stream.metrics > output=kinesis.atom-metrics (out_id=1)

Fluent Bit Version Info

I'm reproducing the same error with the stable and latest version of the image.

Cluster Details

ECS Fargate with awsvpc networking.
The Firehose and Cloudwatch VPC endpoints are enabled.

Application Details

NTR

Steps to reproduce issue

We have run load testing on the container with the same configuration without noticing this error, so it seems this error is happening when the throughput is low.

Related Issues

This is the new configuration I've come up with based on the recommendation given here:
#351

Let me know if I did something wrong.

@DrewZhang13
Copy link
Contributor

DrewZhang13 commented May 26, 2022

This is current recommendation for CloudWatch plugin config. Could you try these config?

Also i wonder how your load testing is running? These errors shows only in low throughput but not high throughput seems not make sense to me.

@LucasHantz
Copy link
Author

image

The above graphs are of a load test we did on our application and the metrics generated by Firelens during that time.
We had no "[http_client] broken connection" during that time but we had new errors later during that day when the cluster was idle.

From what I see in the guidance, this config helps for high throughput cases which is not the problem here. Should I try it anyway?

@LucasHantz
Copy link
Author

Tried with the new config, and still seeing:
[error] [net] connection #51 timeout after 10 seconds to: logs.eu-west-1.amazonaws.com:443

@LucasHantz
Copy link
Author

Any new on this please?

@zhonghui12
Copy link
Contributor

@LucasHantz , may I confirm with you that the issue only occurs with low throughput? So you can confirm that you run fluent bit in the same way and same config and you see problems only with a lower ingestion rate right? May I know what the throughput is?

@LucasHantz
Copy link
Author

The issue happens in fact both the low and high throughput. The following graph is the number of record per minute in the last 8h.
image

As you can see, 2 times in the last 8 hours we have fluent bit falling and not reporting any new logs.
At this time this is the error log we have:
[2022/06/21 14:15:31] [error] [upstream] connection #609 to firehose.eu-west-1.amazonaws.com:443 timed out after 10 seconds
[2022/06/21 14:15:31] [error] [aws_client] connection initialization error

Until the fluentbit container exploded in memory and force the whole task to shutdown

@LucasHantz
Copy link
Author

Any thoughts on this? What I can provide more to help figure out this problem?

@LucasHantz
Copy link
Author

Just saw the issue raised on fluent/fluent-bit#5705 I'm getting this error as well in our traces

@LucasHantz
Copy link
Author

@PettitWesley maybe? Any way to get that pushed up in the line, it's impacting our prod and I don't see how to revert back to a stable solution on this

@PettitWesley
Copy link
Contributor

@LucasHantz Unfortunately right now I don't have any good ideas beyond using the settings here: #340

And checking this: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#network-connection-issues

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants