-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Robustness and resiliency on Azure #3109
Comments
Hi again, based on the findings outlined above, I compiled a list of suggestions for our team the other day, which I also wanted to share here. With kind regards, ProposalsUpgrade to librdkafka 1.5.0Based on findings from others when running Kafka on Azure and especially when using Azure Event Hubs with librdkafka, Magnus Edenhill, the author of librdkafka, implemented some fixes between version 1.3.0 and 1.5.0. Magnus Edenhill:
Magnus Edenhill:
-- via: #2845 Apply recommended settings for Kafka on AzureCharles Culver:
Here are the recommended configurations for using Azure Event Hubs from Apache Kafka client applications: |
Kernel TCP settings for Azure VMsIn the same spirit, I would also like to outline some recommended settings for mitigating the "Azure LB closing idle network connections" on the host level. These settings probably might be applied to the nodes (VMs) as well as the Kubernetes PODs/containers.
Resources
|
Following all of these observations, we planned to upgrade to librdkafka 1.5.0 back then. However, ... There has been a regression in 1.5.0 regarding roundrobin crashes reported through #3024 and confluentinc/confluent-kafka-dotnet#1366 as well as confluentinc/confluent-kafka-dotnet#1369, which was designated as a critical issue. @mhowlett reported on behalf of @wwarby (confluentinc/confluent-kafka-dotnet#1366 (comment)):
As we haven't been sure whether we might run into this on our production systems, we considered to be better off including the fix from #3049. However, back then, the next official release was said to be 1.6.0 happening on October 15, 2020, so we decided to go for librdkafka-1.5.0-e320a2d making our systems happy so far. Thank you so much! Now, we are happy to see the original milestone 1.6.0 was repurposed to milestone 1.5.2 with a corresponding pre-release of v1.5.2-RC1 and will consider going for that within the next iteration. P.S.: While going through this story, we also discovered other open issues about stalled/stuck consumers observed with 1.4.2 at #2933, #2944 and #3082. As far as we have been able to see, these might be related to changing IP addresses while performing rolling updates on Kafka instances within a Kubernetes cluster. Our thoughts on this: Just don't do that ;]! |
JFYI; I also just shared this at Azure/azure-functions-kafka-extension#185. |
Wow, that's an amazing write-up and RCA, @amotl ! |
Dear Magnus, thank you for your comment, appreciate it. Thanks also for just releasing librdkafka 1.5.2 GA incorporating all the recent improvements in this regard and beyond. Keep up the spirit and with kind regards, |
Hi again, as I recognized it might not have become exactly obvious from my elaborations above, I wanted to add some more details here regarding the recommended configuration settings when running on Azure as also outlined at Azure/azure-functions-kafka-extension#187. Microsoft published them to [1] in general, but I wanted to specifically address here that the configuration properties
While these settings [1] are primarily dedicated to Azure Event Hubs, I believe they will also apply to all communications with vanilla Kafka server components, as the underlying networking infrastructure problem will be the same. We are now running with these settings and are happy so far:
Thanks for listening and with kind regards, |
Hi again, I quickly wanted to share some more observations on this topic. On our Azure environment, we still saw some partitions occasionally stalling on the consumer side, even after applying all of the mitigations outlined above. However, after just moving on to Kudos to @edenhill, @mhowlett and all people involved who added some recent improvements to librdkafka! With kind regards, |
Added |
@edenhill When is the probable date for 1.8 release ? waiting for "connections.max.idle.ms" change. |
It is soak testing now, if all goes well we'll release within a week. |
@edenhill Is V1.8.0-RC2 is the release candidate that is in soak testing? |
@koushikchitta Yep, we'll be releasing it this week. |
Just to be clear, I am interpretting these notes to be the following client configuration, I would love to know if there are any other recommended settings to help connect to azure event hubs. var config = new ClientConfig
{
SaslMechanism = SaslMechanism.Plain,
SecurityProtocol = SecurityProtocol.SaslSsl,
SaslUsername = credentials.Username,
SaslPassword = credentials.Password,
BootstrapServers = $"{broker}:9093",
// According to this documentation, these two settings are also critical for
// azure environments
// https://github.com/confluentinc/librdkafka/issues/3109#issuecomment-714471123
SocketKeepaliveEnable = true,
MetadataMaxAgeMs = 30000,
// note: This needs to timeout sooner than the azure eventhub load balancer timeout
// to avoid unexpected disconnects from the server after long periods of time
// https://stackoverflow.com/questions/58010247/azure-eventhub-kafka-org-apache-kafka-common-errors-timeoutexception-for-some-of/58385324#58385324
ConnectionsMaxIdleMs = ((60 * 4) - 30) * 1000 // 3m30s
}; |
Hi there,
first things first: Thanks for the tremendous amount of work you are putting into librdkafka, @edenhill. You know who you are.
Introduction
This is not meant to be a specific bug report as we believe the issues we have been experiencing when using librdkafka for connecting to Azure Event Hubs have already been mitigated within librdkafka 1.5.2 and newer. In fact, they might not have been specific to Azure Event Hubs anyway but also might have tripped others when just running Apache Kafka or the Confluent Stack on Azure in general.
Instead, we wanted to share our findings as a wrapup and future reference for others looking for similar issues. In this manner, apologies for not completing the checklist. The issue can well be closed right away.
The topics are spanning the area of Azure networking (problems) in general, as well as things related to Kafka and Kubernetes.
So, here we go.
General research
Azure LB closing idle network connections
The problem here is that Azure network loadbalancing components silently drop idle network connections after 4 minutes.
The LB does not even bother to send RST packets to each of the communication partners, so client and server sockets will most probably try to reuse these dead connections.
In turn, services will be hitting individual socket timeouts or otherwise the Kernel will be doing retransmissions with backoff for another 15+ minutes until it considers the connection to be dead.
Quotes
Resources
Using Kafka and Event Hubs on Azure
Quotes
Resources
With kind regards,
Andreas.
The text was updated successfully, but these errors were encountered: