From 54750887d8ba46abb379b576c539d25a17429f24 Mon Sep 17 00:00:00 2001 From: wuyi Date: Thu, 11 Nov 2021 16:18:38 -0800 Subject: [PATCH] [SPARK-35011][CORE] Fix false active executor in UI that caused by BlockManager reregistration ### What changes were proposed in this pull request? Also post the event `SparkListenerExecutorRemoved` when removing an executor, which is known by `BlockManagerMaster` but unknown to `SchedulerBackend`. ### Why are the changes needed? In https://github.com/apache/spark/pull/32114, it reports an issue that `BlockManagerMaster` could register a `BlockManager` from a dead executor due to reregistration mechanism. The side effect is, the executor will be shown on the UI as an active one, though it's already dead indeed. In https://github.com/apache/spark/pull/32114, we tried to avoid such reregistration for a to-be-dead executor. However, I just realized that we can actually leave such reregistration alone since `HeartbeatReceiver.expireDeadHosts` should clean up those `BlockManager`s in the end. The problem is, the corresponding executors in UI can't be cleaned along with the `BlockManager`s cleaning. Because executors in UI can only be cleaned by `SparkListenerExecutorRemoved`, while `BlockManager`s cleaning only post `SparkListenerBlockManagerRemoved` (which is ignored by `AppStatusListener`). ### Does this PR introduce _any_ user-facing change? Yes, users would see the false active executor be removed in the end. ### How was this patch tested? Pass existing tests. Closes #34536 from Ngone51/SPARK-35011. Lead-authored-by: wuyi Co-authored-by: yi.wu Signed-off-by: Dongjoon Hyun --- .../scheduler/cluster/CoarseGrainedSchedulerBackend.scala | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala index b40eee3695673..326ea833eeaf4 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala @@ -438,6 +438,14 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp // about the executor, but the scheduler will not. Therefore, we should remove the // executor from the block manager when we hit this case. scheduler.sc.env.blockManager.master.removeExecutorAsync(executorId) + // SPARK-35011: If we reach this code path, which means the executor has been + // already removed from the scheduler backend but the block manager master may + // still know it. In this case, removing the executor from block manager master + // would only post the event `SparkListenerBlockManagerRemoved`, which is unfortunately + // ignored by `AppStatusListener`. As a result, the executor would be shown on the UI + // forever. Therefore, we should also post `SparkListenerExecutorRemoved` here. + listenerBus.post(SparkListenerExecutorRemoved( + System.currentTimeMillis(), executorId, reason.toString)) logInfo(s"Asked to remove non-existent executor $executorId") } }