Skip to content

Recommended Cloudwatch_Logs Configuration #340

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
matthewfala opened this issue May 3, 2022 · 2 comments
Open

Recommended Cloudwatch_Logs Configuration #340

matthewfala opened this issue May 3, 2022 · 2 comments
Labels
guidance Customer is seeking guidance from us/the community

Comments

@matthewfala
Copy link
Contributor

matthewfala commented May 3, 2022

NOTICE: please see the main tracking ticket for multiple recently reported high impact issues in AWS for Fluent Bit: #542

Recommended Cloudwatch_Logs Configuration

Recently our team has received lots of inquires on tuning the cloudwatch_logs output plugin via it's configuration

Customers using out of tune cloudwatch configurations may experience log loss due to:

  • Broken connection / network errors
  • Lack of retries on batch failures
  • Lack of immediate network retries on network failure

These issues can be resolved via appropriate configuration.

If you are configuring FireLens via a Fluent Bit config file, use the following cloudwatch_logs configuration:

[OUTPUT]
    # general cloudwatch_logs configuration (nothing special here, customize to fit your use case)
    Name                cloudwatch_logs
    Match               ApplicationLogs
    region              ${LOG_REGION}
    log_group_name      ${SERVICE_NAME}-ApplicationLogs
    log_stream_prefix   ApplicationLogs--${HOSTNAME}
    auto_create_group   On

    # if you want to only write the log string without container metadata fields
    log_key             log

    # from aws-for-fluent-bit v2.32.0 and on, to support higher throughput logging,
    # set workers to a high value such as 5 or the number of cores on your host
    workers             1

    # optimized cloudwatch_logs output configuration
    # delayed retries on error 
    retry_limit         5    
    # on is default
    net.keepalive On
    # CW uses 6s idle timeout, FLB has 1.5s timer to check conns.
    # 4s ensures FLB always closes the conn itself, which we found 
    # significantly reduces the rate of network error messages it outputs
    net.keepalive_idle_timeout 4s

If you are configuring FireLens via task definition logDriver configuration options:

"logConfiguration": {
	"logDriver":"awsfirelens",
	"options": {

// general cloudwatch_logs configuration (nothing special here, customize to fit your use case)
		"Name": "cloudwatch_logs",
		"region": "${LOG_REGION}",
		"log_group_name": "${SERVICE_NAME}-ApplicationLogs",
		"log_stream_prefix": "ApplicationLogs--${HOSTNAME}",
		"auto_create_group": "On",
		"log_key": "log",

// optimized cloudwatch_logs output configuration
		"workers": "1",
		"auto_retry_requests": "On",
		"retry_limit": "5"
	}
}

We may update the above configuration from time to time to reflect the cloudwatch_logs configuration that provides the best performance.

@PettitWesley
Copy link
Contributor

These settings used to be in the example but are no longer since they are same as the defaults since 1.9.x Fluent Bit upstream version series:

    # create a separate thread for each cloudwatch_output (does not work with more than one worker per log stream due to cloudwatch_logs API concurrency limitations)
    # as of Fluent Bit 1.9, 1 worker is the default
    workers             1   
    # retry network requests immediately on failure
    # this setting also defaults to "On" in the 1.9 series. 
    auto_retry_requests On  

@Duplo-Yashwant
Copy link

Is there dual options is supported under logConfiguration? Like sending logs to cloudwatch as well as opensearch.

github-merge-queue bot pushed a commit to linz/topo-workflows that referenced this issue Apr 3, 2024
#### Motivation

Fluent Bit is experiencing a lot of network errors connecting to
`logs.ap-southeast-2.amazonaws.com`. This amount of errors does increase
the log storage cost, see
#374.
This is a known issue for which Fluent Bit team made [some
recommendations to reduce
it](aws/aws-for-fluent-bit#340). This PR is
applying one of these recommendations and has been tested with success
on non prod.

#### Modification

- Remove [the patch](#374)
that stops sending Fluent Bit application logs to CloudWatch
- Set the Fluent Bit `keepalive idle timeout` to 4s (default is 1.5s)
following [the recommendations made
here](aws/aws-for-fluent-bit#340).

#### Checklist

- [ ] Tests updated - N/A
- [x] Docs updated
- [x] Issue linked in Title

---------

Co-authored-by: Victor Engmark <vengmark@linz.govt.nz>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
guidance Customer is seeking guidance from us/the community
Projects
None yet
Development

No branches or pull requests

3 participants