-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Improve median performance. #6837
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
let state = | ||
ScalarValue::new_list(Some(self.all_values.clone()), self.data_type.clone()); | ||
let all_values = to_scalar_values(&self.batches)?; | ||
let state = ScalarValue::new_list(Some(all_values), self.data_type.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can change ScalarValue::List
to use ArrayRef
instead of Vec<ScalarValue>
internally.
This would avoid quite some expensive conversions from/to ScalarValue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would help a lot, maybe we can add a ScalarValue::Array
variant to ScalarValue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @Dandandan was suggesting
impl ScalarValue {
...
List(ArrayRef)
...
}
In general this would work well with the approach @tustvold is working on upstream in arrow-rs with Datum
-- apache/arrow-rs#4393
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's what I was trying to suggest - this would avoid the need to convert / allocate to individual ScalarValue
s and convert to Array
later on again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work well, the reason I was suggesting adding an Array
variant instead of changing List
was to avoid changing all the code that depends on List(Option<Vec<ScalarValue>>, FieldRef)
, but yes long term that would probably be best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this some more, I think the most performant thing to do will be to implement a native GroupsAccumulator
(aka #6800 ) for median. With sufficient effort we could make median be very fast -- so I think this is a good improvement for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vincev -- this looks great -- I will try and review this tomorrow
let state = | ||
ScalarValue::new_list(Some(self.all_values.clone()), self.data_type.clone()); | ||
let all_values = to_scalar_values(&self.batches)?; | ||
let state = ScalarValue::new_list(Some(all_values), self.data_type.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @Dandandan was suggesting
impl ScalarValue {
...
List(ArrayRef)
...
}
In general this would work well with the approach @tustvold is working on upstream in arrow-rs with Datum
-- apache/arrow-rs#4393
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the code carefully -- thank you @vincev
I think the code could be merged in as is, but I also left some comments which I think would help.
let state = | ||
ScalarValue::new_list(Some(self.all_values.clone()), self.data_type.clone()); | ||
let all_values = to_scalar_values(&self.batches)?; | ||
let state = ScalarValue::new_list(Some(all_values), self.data_type.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this some more, I think the most performant thing to do will be to implement a native GroupsAccumulator
(aka #6800 ) for median. With sufficient effort we could make median be very fast -- so I think this is a good improvement for now
all_values: Vec<ScalarValue>, | ||
} | ||
|
||
fn to_scalar_values(arrays: &[ArrayRef]) -> Result<Vec<ScalarValue>> { | ||
let num_values: usize = arrays.iter().map(|a| a.len()).sum(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for computing the capacity up front
Thanks @vincev |
Thank you @alamb, @Dandandan for your feedback and review. |
* Improve median performance. * Fix formatting. * Review feedback * Renamed arrays size.
* Improve median performance. * Fix formatting. * Review feedback * Renamed arrays size.
Which issue does this PR close?
Related to discussion in #4973.
Rationale for this change
This PR improves the median aggregator by reducing the number of allocations.
For this example I am using one year of data from the nyctaxi dataset, running a group by
payment_type
and computing thetotal_amount
median, with the current version it takes ~22secs:with the changes introduced in this PR we get the same result in ~2secs:
What changes are included in this PR?
Reduce number of allocations.
Are these changes tested?
Run tests locally they all passed.
Are there any user-facing changes?