[WIP][Scheduling] Add worker node failover with lineage #3307
+811
−182
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What do these changes do?
Currently, the supervisor will cancel the execution of the entire stage after receiving an error report that the execution of the subtask fails when the MainPool process exits in Mars. For traditional batch processing, just rerun is good. However, this is very unfriendly to the scenario of a large job, because it has many subtasks and takes a long time to run. Once it fails, it will be expensive to rerun. At the same time, large jobs generally require much more nodes, and the probability of corresponding node failures will also increase. Once a node failure causes the MainPool to exit, the data on the corresponding node will be lost, and subsequent dependencies execution will fail. For example, a job has been running for more than 20 hours, and the execution is 90% complete, but because a certain MainPool exits, the entire job will fail. Large jobs are relatively common in modern data processing. A job will take up 1200 nodes or more, and it will take about 40 hours.
In order to solve the node failure problem and ensure the stable and normal operation of jobs, a complete node failover solution is required.
Related issue number
Issue #3308
Check code requirements