Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[fix][broker] timeout when broker registry hangs and monitor broker registry (ExtensibleLoadManagerImpl only) #23382

Merged
merged 6 commits into from
Oct 3, 2024

Conversation

heesung-sn
Copy link
Contributor

@heesung-sn heesung-sn commented Oct 1, 2024

Motivation

  • metadata store could fail the broker lock registration. In this case, we should retry and also monitor the registry periodically and try to register it if not registered yet.
  • also we need to verify if the broker is healthy or not before starting the orphan bundle cleanups, as the broker might receive "out-dated" notifications.

Modifications

  • add monitor logic to periodically check if the broker is registered and to try to register it.
  • add timeout logic when the broker re-registration hangs.
  • add retry logic when the broker re-registration fails.
  • add health check logic before starting the orphan bundle cleanup jobs and canceling the cleanup jobs.
  • add brokerId in broker health admin api

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: heesung-sn#82

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Oct 1, 2024
@heesung-sn heesung-sn changed the title [fix][broker] timeout when broker registry hangs and monitor broker registry [fix][broker] timeout when broker registry hangs and monitor broker registry (ExtensibleLoadManagerImpl only) Oct 1, 2024
@lhotari lhotari added this to the 4.0.0 milestone Oct 1, 2024
@lhotari
Copy link
Member

lhotari commented Oct 1, 2024

@heesung-sn please add the release labels that this applies to.

@heesung-sn heesung-sn marked this pull request as draft October 1, 2024 22:05
@heesung-sn
Copy link
Contributor Author

I realized we need to add more validation upon this broker delete/create notifications as these notification could be "out-dated".

@lhotari
Copy link
Member

lhotari commented Oct 2, 2024

@heesung-sn I ran into a hanging test recently. Is this something that is related?

ZkSessionExpireTest.cleanup hangs in ModularLoadManagerImpl.disableBroker

"main" #1 [2383] prio=5 os_prio=0 cpu=101566.56ms elapsed=3520.41s tid=0x00007f3618030a30 nid=2383 waiting on condition  [0x00007f3621f54000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)
	- parking to wait for  <0x0000100041604dc8> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(java.base@21.0.4/LockSupport.java:221)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@21.0.4/CompletableFuture.java:1864)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@21.0.4/ForkJoinPool.java:3780)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@21.0.4/ForkJoinPool.java:3725)
	at java.util.concurrent.CompletableFuture.waitingGet(java.base@21.0.4/CompletableFuture.java:1898)
	at java.util.concurrent.CompletableFuture.join(java.base@21.0.4/CompletableFuture.java:2117)
	at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.disableBroker(ModularLoadManagerImpl.java:603)
	at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerWrapper.disableBroker(ModularLoadManagerWrapper.java:47)
	at org.apache.pulsar.broker.service.BrokerService.unloadNamespaceBundlesGracefully(BrokerService.java:949)
	at org.apache.pulsar.broker.service.BrokerService.unloadNamespaceBundlesGracefully(BrokerService.java:936)
	at org.apache.pulsar.broker.PulsarService.closeAsync(PulsarService.java:520)
	at org.apache.pulsar.broker.PulsarService.close(PulsarService.java:479)
	at org.apache.pulsar.broker.service.NetworkErrorTestBase.cleanup(NetworkErrorTestBase.java:215)
	at org.apache.pulsar.broker.service.ZkSessionExpireTest.cleanup(ZkSessionExpireTest.java:50)

Thread dump: https://gist.github.com/lhotari/68578ba1354d4833af489b9ac3658c8c
Analysed in jstack.review: https://jstack.review/?https://gist.github.com/lhotari/68578ba1354d4833af489b9ac3658c8c#tda_1_dump

I created #23388 and #23389 for ZkSessionExpireTest flakiness issues that appeared in subsequent builds.

@heesung-sn
Copy link
Contributor Author

@heesung-sn I ran into a hanging test recently. Is this something that is related?

Potentially, I think this could be related, regarding why the broker registry lock creation didn't go through. Unfortunately, we couldn't confirm this because we haven't put error log when creating the lock.

I see a potential deadlock report in NativeReferenceQueue$Lock. I wonder if java 21 has any bugs there.
Screen Shot 2024-10-02 at 4 19 30 AM

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 89.89899% with 10 lines in your changes missing coverage. Please review.

Project coverage is 74.54%. Comparing base (bbc6224) to head (051295d).
Report is 625 commits behind head on master.

Files with missing lines Patch % Lines
...xtensions/channel/ServiceUnitStateChannelImpl.java 84.78% 5 Missing and 2 partials ⚠️
...ker/loadbalance/extensions/BrokerRegistryImpl.java 94.11% 1 Missing and 1 partial ⚠️
...ache/pulsar/client/admin/internal/BrokersImpl.java 85.71% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23382      +/-   ##
============================================
+ Coverage     73.57%   74.54%   +0.97%     
- Complexity    32624    34015    +1391     
============================================
  Files          1877     1936      +59     
  Lines        139502   145365    +5863     
  Branches      15299    15893     +594     
============================================
+ Hits         102638   108366    +5728     
+ Misses        28908    28681     -227     
- Partials       7956     8318     +362     
Flag Coverage Δ
inttests 27.49% <24.24%> (+2.91%) ⬆️
systests 24.48% <0.00%> (+0.16%) ⬆️
unittests 73.88% <89.89%> (+1.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...g/apache/pulsar/broker/admin/impl/BrokersBase.java 73.02% <100.00%> (+0.57%) ⬆️
...dbalance/extensions/ExtensibleLoadManagerImpl.java 79.25% <100.00%> (-0.83%) ⬇️
...ache/pulsar/client/admin/internal/BrokersImpl.java 88.23% <85.71%> (+0.73%) ⬆️
...ker/loadbalance/extensions/BrokerRegistryImpl.java 84.21% <94.11%> (-0.56%) ⬇️
...xtensions/channel/ServiceUnitStateChannelImpl.java 85.35% <84.78%> (+0.04%) ⬆️

... and 601 files with indirect coverage changes

@BewareMyPower BewareMyPower dismissed their stale review October 3, 2024 06:41

Comments resolved

@heesung-sn heesung-sn merged commit eee9283 into apache:master Oct 3, 2024
51 checks passed
@lhotari
Copy link
Member

lhotari commented Oct 4, 2024

@heesung-sn Cherry-picking this to branch-3.0 doesn't get applied cleanly. Would you be able to handle cherry-picking the required changes to branch-3.0?

heesung-sn added a commit to heesung-sn/pulsar that referenced this pull request Oct 23, 2024
…egistry (ExtensibleLoadManagerImpl only) (apache#23382)

(cherry picked from commit eee9283)
heesung-sn added a commit to heesung-sn/pulsar that referenced this pull request Oct 23, 2024
…egistry (ExtensibleLoadManagerImpl only) (apache#23382)

(cherry picked from commit eee9283)
heesung-sn added a commit that referenced this pull request Oct 25, 2024
hanmz pushed a commit to hanmz/pulsar that referenced this pull request Feb 12, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants