-
Notifications
You must be signed in to change notification settings - Fork 3.7k
ARROW-10136: [Rust]: Fix null handling in StringArray and BinaryArray filtering, add BinaryArray::from_opt_vec #8303
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
@@ -353,15 +353,19 @@ impl FilterContext { | |||
// foreach bit in batch: | |||
if (filter_batch & self.filter_mask[j]) != 0 { | |||
let data_index = (i * 64) + j; | |||
values.push(input_array.value(data_index)); | |||
if input_array.is_null(data_index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same pattern as in the handling for primative array: https://github.com/apache/arrow/pull/8303/files#diff-d7b0b7cde1850e8744ceda458c6dea81R294-L298
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is the code for BinaryArray that @nevi-me referred to in #8303 (comment))
@@ -373,7 +377,11 @@ impl FilterContext { | |||
// foreach bit in batch: | |||
if (filter_batch & self.filter_mask[j]) != 0 { | |||
let data_index = (i * 64) + j; | |||
values.push(input_array.value(data_index)); | |||
if input_array.is_null(data_index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, this special case appears to miss the null check too
} | ||
} | ||
} | ||
Ok(Arc::new(BinaryArray::from(values))) | ||
} | ||
DataType::Utf8 => { | ||
let input_array = array.as_any().downcast_ref::<StringArray>().unwrap(); | ||
let mut values: Vec<&str> = Vec::with_capacity(self.filtered_count); | ||
let mut values: Vec<Option<&str>> = Vec::with_capacity(self.filtered_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note using an Option
is likely to increase the temporary storage requirements a bit.
It would likely be possible to avoid this allocation entirely if we used the lower level ArrayBuilder::with_bit_buffer
.
I chose to follow the style of the rest of this module, though I would love opinions on trying to perf check this / optimize it (maybe a follow on JIRA ticket is enough)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't an Option
of a reference leverages null pointer optimization (on which None
is represented by the null pointer, e.g. rust-lang/rust#9378)?
IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?
Yes, I believe you are correct.
This program:
fn main() {
println!("The size of a &str is {}", std::mem::size_of::<&str>());
println!("The size of an Option<&str> is {}", std::mem::size_of::<Option<&str>>());
}
Produces the following on my machine:
The size of a &str is 16
The size of an Option<&str> is 16
Thanks @alamb, this existed as https://issues.apache.org/jira/browse/ARROW-5352; so I'll close that out. The same behaviour would occur with a And yes, the issue didn't affect primitive arrays, I think it was because we were pushing an empty string where we should have pushed a null slot on the null bitmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, @alamb . Looks good so far. I left some minor comments.
} | ||
} | ||
} | ||
Ok(Arc::new(BinaryArray::from(values))) | ||
} | ||
DataType::Utf8 => { | ||
let input_array = array.as_any().downcast_ref::<StringArray>().unwrap(); | ||
let mut values: Vec<&str> = Vec::with_capacity(self.filtered_count); | ||
let mut values: Vec<Option<&str>> = Vec::with_capacity(self.filtered_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't an Option
of a reference leverages null pointer optimization (on which None
is represented by the null pointer, e.g. rust-lang/rust#9378)?
IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.
@@ -1267,6 +1267,36 @@ impl<OffsetSize: OffsetSizeTrait> GenericBinaryArray<OffsetSize> { | |||
GenericBinaryArray::<OffsetSize>::from(array_data) | |||
} | |||
|
|||
fn from_opt_vec(v: Vec<Option<&[u8]>>, data_type: DataType) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this code tested somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is tested (indirectly) in https://github.com/apache/arrow/pull/8303/files#diff-d7b0b7cde1850e8744ceda458c6dea81R700 -- but I think a more specific test would be valuable. I will add one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned out to be a great call @jorgecarleitao -- I found a bug in this implementation while writing a test. Thank you for the suggestion. 💯
let c = filter(&a, &b).unwrap(); | ||
let d = c.as_ref().as_any().downcast_ref::<StringArray>().unwrap(); | ||
assert_eq!(2, d.len()); | ||
assert_eq!("hello", d.value(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we test the 3 quantities: d.is_null(0)
, d.value(0)
, d.is_null(1)
. Alternatively,
let expected = StringArray::from(vec![Some("hello"), None]);
assert_eq!(d, expected);
let c = filter(&a, &b).unwrap(); | ||
let d = c.as_ref().as_any().downcast_ref::<BinaryArray>().unwrap(); | ||
assert_eq!(2, d.len()); | ||
assert_eq!(b"hello", d.value(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
Thanks @nevi-me -- yes I updated the code tried to handle both types in the same way (and including tests) |
@jorgecarleitao I believe you are correct: This program:
Produces the following on my machine:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao you said:
IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.
Can you perhaps let me know what you mean by this? Perhaps there is an example in the code you are thinking of?
@@ -1267,6 +1267,36 @@ impl<OffsetSize: OffsetSizeTrait> GenericBinaryArray<OffsetSize> { | |||
GenericBinaryArray::<OffsetSize>::from(array_data) | |||
} | |||
|
|||
fn from_opt_vec(v: Vec<Option<&[u8]>>, data_type: DataType) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is tested (indirectly) in https://github.com/apache/arrow/pull/8303/files#diff-d7b0b7cde1850e8744ceda458c6dea81R700 -- but I think a more specific test would be valuable. I will add one.
} | ||
} | ||
} | ||
Ok(Arc::new(BinaryArray::from(values))) | ||
} | ||
DataType::Utf8 => { | ||
let input_array = array.as_any().downcast_ref::<StringArray>().unwrap(); | ||
let mut values: Vec<&str> = Vec::with_capacity(self.filtered_count); | ||
let mut values: Vec<Option<&str>> = Vec::with_capacity(self.filtered_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?
Yes, I believe you are correct.
This program:
fn main() {
println!("The size of a &str is {}", std::mem::size_of::<&str>());
println!("The size of an Option<&str> is {}", std::mem::size_of::<Option<&str>>());
}
Produces the following on my machine:
The size of a &str is 16
The size of an Option<&str> is 16
I think I have addressed all your (very helpful) comments @jorgecarleitao . |
I am really, sorry, @alamb , I should have offered more context in the first place. :/ This in no way blocks this PR: IMO it is ready to merge if the relevant tests pass. What I meant is that this code currently:
it may be more efficient to create the buffers during the iteration, so that we avoid the copy (Vec -> buffers). In other words, the code in (as a side note, this is why I am proposing #8211 : IMO there is some boiler-plate copy-pasting to
which will continue to grow as we add more kernels, and whose pattern seems to be a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, really nice additions.
@jorgecarleitao -- yes thank you that makes a lot of sense. I have filed https://issues.apache.org/jira/browse/ARROW-10141 to track that |
@andygrove / @jorgecarleitao / @nevi-me I wonder if this PR might be merged anytime soon (I have a downstream project relying on this change) The integration test failure https://github.com/apache/arrow/pull/8303/checks?check_run_id=1187275161 seems due to a network failure (not anything with this PR):
|
I'll rebase to try and get a clean test run |
53f04f0
to
b83d867
Compare
All tests are passing now. 🎉 |
… filtering, add BinaryArray::from_opt_vec When I use the `filter` kernel with Null strings, any input column that was Null turns into an empty string after filtering. ``` "foo" "bar" NULL ``` And the filter ``` true true true ``` Will result in ``` "foo" "bar" "" ``` Rather than ``` "foo" "bar" NULL ``` It appears to work fine for primitive arrays (I'll comment inline). I also added `BinaryArray::from_opt_vec` following the model of `PrimativeArray` and `StringArray` mostly so I could write a test. Closes apache#8303 from alamb/alamb/ARROW-10136-null-filter Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Andy Grove <andygrove73@gmail.com>
… filtering, add BinaryArray::from_opt_vec When I use the `filter` kernel with Null strings, any input column that was Null turns into an empty string after filtering. ``` "foo" "bar" NULL ``` And the filter ``` true true true ``` Will result in ``` "foo" "bar" "" ``` Rather than ``` "foo" "bar" NULL ``` It appears to work fine for primitive arrays (I'll comment inline). I also added `BinaryArray::from_opt_vec` following the model of `PrimativeArray` and `StringArray` mostly so I could write a test. Closes apache#8303 from alamb/alamb/ARROW-10136-null-filter Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Andy Grove <andygrove73@gmail.com>
When I use the
filter
kernel with Null strings, any input column that was Null turns into an empty string after filtering.And the filter
Will result in
Rather than
It appears to work fine for primitive arrays (I'll comment inline). I also added
BinaryArray::from_opt_vec
following the model ofPrimativeArray
andStringArray
mostly so I could write a test.