Skip to content

Commit

Permalink
[SPARK-35011][CORE] Fix false active executor in UI that caused by Bl…
Browse files Browse the repository at this point in the history
…ockManager reregistration

### What changes were proposed in this pull request?

Also post the event `SparkListenerExecutorRemoved` when removing an executor, which is known by `BlockManagerMaster` but unknown to `SchedulerBackend`.

### Why are the changes needed?

In #32114, it reports an issue that `BlockManagerMaster` could register a `BlockManager` from a dead executor due to reregistration mechanism. The side effect is, the executor will be shown on the UI as an active one, though it's already dead indeed.

In #32114, we tried to avoid such reregistration for a to-be-dead executor. However, I just realized that we can actually leave such reregistration alone since `HeartbeatReceiver.expireDeadHosts` should clean up those `BlockManager`s in the end. The problem is, the corresponding executors in UI can't be cleaned along with the `BlockManager`s cleaning. Because executors in UI can only be cleaned by `SparkListenerExecutorRemoved`,
 while `BlockManager`s  cleaning only post `SparkListenerBlockManagerRemoved` (which is ignored by `AppStatusListener`).

### Does this PR introduce _any_ user-facing change?

Yes, users would see the false active executor be removed in the end.

### How was this patch tested?

Pass existing tests.

Closes #34536 from Ngone51/SPARK-35011.

Lead-authored-by: wuyi <yi.wu@databricks.com>
Co-authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
  • Loading branch information
Ngone51 authored and dongjoon-hyun committed Nov 12, 2021
1 parent 538eb96 commit 5475088
Showing 1 changed file with 8 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -438,6 +438,14 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
// about the executor, but the scheduler will not. Therefore, we should remove the
// executor from the block manager when we hit this case.
scheduler.sc.env.blockManager.master.removeExecutorAsync(executorId)
// SPARK-35011: If we reach this code path, which means the executor has been
// already removed from the scheduler backend but the block manager master may
// still know it. In this case, removing the executor from block manager master
// would only post the event `SparkListenerBlockManagerRemoved`, which is unfortunately
// ignored by `AppStatusListener`. As a result, the executor would be shown on the UI
// forever. Therefore, we should also post `SparkListenerExecutorRemoved` here.
listenerBus.post(SparkListenerExecutorRemoved(
System.currentTimeMillis(), executorId, reason.toString))
logInfo(s"Asked to remove non-existent executor $executorId")
}
}
Expand Down

0 comments on commit 5475088

Please # to comment.