Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Improve performance of Sort for the common single batch use case #10572

Merged
merged 2 commits into from
Mar 13, 2024

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Mar 12, 2024

This fixes #10570

The performance speed up is not that huge, but it is there.

I ran

spark.time(spark.range(0, 100000000L, 1, 12).selectExpr("id as oc2", "id DIV 10 as oc3", "CAST(id * 10 AS STRING) as oc", "CAST(id % 4 AS STRING) as pc", "id").selectExpr("*", "row_number() over (PARTITION BY pc ORDER BY oc, oc2, oc3) as rn", "max(id) OVER (PARTITION BY pc ORDER BY oc, oc2, oc3 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as m").orderBy(desc("oc")).show())

Both with this patch and without it. I captured the metric for sort time and the total run time of the query.

With this patch, on my desktop, the median run time was 6968 ms, and from the Spark UI the median sort time was 5.5 seconds.
Without this patch the runtime was 7051 and the sort time was 5.9 seconds. That saves 83 ms (about 1% which is not really that huge), but the sort time metric showed about 0.4 seconds of savings or about 6% less time, which is a lot better.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
@revans2
Copy link
Collaborator Author

revans2 commented Mar 12, 2024

build

@sameerz sameerz added the performance A performance related task/issue label Mar 12, 2024
Comment on lines +192 to +200
val spillableIter = iter.flatMap { cb =>
// Filter out empty batches and make them spillable
if (cb.numRows() > 0) {
Some(SpillableColumnarBatch(cb, SpillPriorities.ACTIVE_ON_DECK_PRIORITY))
} else {
cb.close()
None
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If single batch is the common case it does not seem to matter but if we needed to save intermediate Option generation we could use an explicit PartialFunction instance:

Suggested change
val spillableIter = iter.flatMap { cb =>
// Filter out empty batches and make them spillable
if (cb.numRows() > 0) {
Some(SpillableColumnarBatch(cb, SpillPriorities.ACTIVE_ON_DECK_PRIORITY))
} else {
cb.close()
None
}
}
val spillableIter = iter.collect {
// Filter out empty batches and make them spillable
new PartialFunction[ColumnarBatch, SpillableColumnarBatch] {
override def isDefinedAt(cb: ColumnarBatch): Boolean = if (cb.numRows() > 0) {
true
} else {
cb.close()
false
}
override def apply(cb: ColumnarBatch): SpillableColumnarBatch =
SpillableColumnarBatch(cb, SpillPriorities.ACTIVE_ON_DECK_PRIORITY)
}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep it in mind, but I'm not sure it matters that much here.

if (cb.numRows() > 0) {
Some(SpillableColumnarBatch(cb, SpillPriorities.ACTIVE_ON_DECK_PRIORITY))
} else {
cb.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this throws?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*/
private final def firstPassReadBatches(scb: SpillableColumnarBatch): Unit = {
splitOneSortedBatch(scb)
while(alreadySortedIter.hasNext) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space

Suggested change
while(alreadySortedIter.hasNext) {
while (alreadySortedIter.hasNext) {

@revans2
Copy link
Collaborator Author

revans2 commented Mar 13, 2024

build

@revans2 revans2 merged commit 9105fd7 into NVIDIA:branch-24.04 Mar 13, 2024
43 checks passed
@revans2 revans2 deleted the sort_better_expr branch March 13, 2024 19:36
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] See if we can optimize sort for a single batch
4 participants