Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Flaky-test: BrokerRegistryIntegrationTest.testRecoverFromNodeDeletion #23365

Closed
1 of 2 tasks
lhotari opened this issue Sep 28, 2024 · 1 comment · Fixed by #23371
Closed
1 of 2 tasks

Flaky-test: BrokerRegistryIntegrationTest.testRecoverFromNodeDeletion #23365

lhotari opened this issue Sep 28, 2024 · 1 comment · Fixed by #23371

Comments

@lhotari
Copy link
Member

lhotari commented Sep 28, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Example failure

https://github.com/apache/pulsar/actions/runs/11074101345/job/30796634094?pr=23362#step:11:1686

Exception stacktrace

  Error:  org.apache.pulsar.broker.loadbalance.extensions.BrokerRegistryIntegrationTest.testRecoverFromNodeDeletion  Time elapsed: 5.037 s  <<< FAILURE!
  org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a org.apache.pulsar.broker.loadbalance.extensions.BrokerRegistryIntegrationTest lists don't have the same size expected [1] but found [0] within 3 seconds.
  	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
  	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
  	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
  	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:985)
  	at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:769)
  	at org.apache.pulsar.broker.loadbalance.extensions.BrokerRegistryIntegrationTest.testRecoverFromNodeDeletion(BrokerRegistryIntegrationTest.java:78)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
  	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
  	at org.testng.internal.invokers.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:47)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:76)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
  	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  	at java.base/java.lang.Thread.run(Thread.java:840)
  Caused by: java.lang.AssertionError: lists don't have the same size expected [1] but found [0]
  	at org.testng.Assert.fail(Assert.java:110)
  	at org.testng.Assert.failNotEquals(Assert.java:1577)
  	at org.testng.Assert.assertEqualsImpl(Assert.java:149)
  	at org.testng.Assert.assertEquals(Assert.java:131)
  	at org.testng.Assert.assertEquals(Assert.java:1418)
  	at org.testng.Assert.assertEquals(Assert.java:1382)
  	at org.testng.Assert.assertEquals(Assert.java:1629)
  	at org.testng.Assert.assertEquals(Assert.java:1605)
  	at org.apache.pulsar.broker.loadbalance.extensions.BrokerRegistryIntegrationTest.lambda$testRecoverFromNodeDeletion$1(BrokerRegistryIntegrationTest.java:78)
  	at org.awaitility.core.AssertionCondition.lambda$new$0(AssertionCondition.java:53)
  	at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:248)
  	at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:235)
  	... 4 more

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@BewareMyPower
Copy link
Contributor

The direct cause is that after #23349, the ServiceUnitStateTableViewImpl#flush will call TableView#refresh to refresh the internal cache.

However, there is a bug with TableViewImpl that the refresh on an empty topic will be stuck. This bug can be reproduced by the following patch:

diff --git a/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/TableViewTest.java b/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/TableViewTest.java
index 61ab4de8a3..5448751160 100644
--- a/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/TableViewTest.java
+++ b/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/TableViewTest.java
@@ -173,6 +173,9 @@ public class TableViewTest extends MockedPulsarServiceBaseTest {
         TableView<byte[]> tv = pulsarClient.newTableView(Schema.BYTES)
                 .topic(topic)
                 .create();
+        // Verify refresh can handle the case when the topic is empty
+        tv.refreshAsync().get(3, TimeUnit.SECONDS);
+
         // 2. Add a listen action to provide the test environment.
         // The listen action will be triggered when there are incoming messages every time.
         // This is a sync operation, so sleep in the listen action can slow down the reading rate of messages.

There is another possible cause that BrokerRegisteryImpl#registerAsync is called in the load manager thread when it detects the node is deleted. However, this thread could be blocked by some blocking calls like the flush method above.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants