Skip to content

ARROW-10136: [Rust]: Fix null handling in StringArray and BinaryArray filtering, add BinaryArray::from_opt_vec #8303

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
wants to merge 3 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 29, 2020

When I use the filter kernel with Null strings, any input column that was Null turns into an empty string after filtering.

"foo"
"bar"
NULL

And the filter

true
true
true

Will result in

"foo"
"bar"
""

Rather than

"foo"
"bar"
NULL

It appears to work fine for primitive arrays (I'll comment inline). I also added BinaryArray::from_opt_vec following the model of PrimativeArray and StringArray mostly so I could write a test.

@github-actions
Copy link

@@ -353,15 +353,19 @@ impl FilterContext {
// foreach bit in batch:
if (filter_batch & self.filter_mask[j]) != 0 {
let data_index = (i * 64) + j;
values.push(input_array.value(data_index));
if input_array.is_null(data_index) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is the code for BinaryArray that @nevi-me referred to in #8303 (comment))

@@ -373,7 +377,11 @@ impl FilterContext {
// foreach bit in batch:
if (filter_batch & self.filter_mask[j]) != 0 {
let data_index = (i * 64) + j;
values.push(input_array.value(data_index));
if input_array.is_null(data_index) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, this special case appears to miss the null check too

}
}
}
Ok(Arc::new(BinaryArray::from(values)))
}
DataType::Utf8 => {
let input_array = array.as_any().downcast_ref::<StringArray>().unwrap();
let mut values: Vec<&str> = Vec::with_capacity(self.filtered_count);
let mut values: Vec<Option<&str>> = Vec::with_capacity(self.filtered_count);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note using an Option is likely to increase the temporary storage requirements a bit.

It would likely be possible to avoid this allocation entirely if we used the lower level ArrayBuilder::with_bit_buffer.

I chose to follow the style of the rest of this module, though I would love opinions on trying to perf check this / optimize it (maybe a follow on JIRA ticket is enough)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?

IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?

Yes, I believe you are correct.

This program:

fn main() {
    println!("The size of a &str is {}", std::mem::size_of::<&str>());
    println!("The size of an Option<&str> is {}", std::mem::size_of::<Option<&str>>());
}

Produces the following on my machine:

The size of a &str is 16
The size of an Option<&str> is 16

@nevi-me
Copy link
Contributor

nevi-me commented Sep 29, 2020

Thanks @alamb, this existed as https://issues.apache.org/jira/browse/ARROW-5352; so I'll close that out.

The same behaviour would occur with a BinaryArray, I haven't looked at this PR, but it's worthwhile to ensure that StringArray and BinaryArray are treated consistently.

And yes, the issue didn't affect primitive arrays, I think it was because we were pushing an empty string where we should have pushed a null slot on the null bitmap

@nevi-me nevi-me self-requested a review September 29, 2020 23:19
@nevi-me nevi-me changed the title ARROW-10136: [Rust][Arrow]: Fix null handling in StringArray and BinaryArray filtering, add BinaryArray::from_opt_vec ARROW-10136: [Rust]: Fix null handling in StringArray and BinaryArray filtering, add BinaryArray::from_opt_vec Sep 29, 2020
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @alamb . Looks good so far. I left some minor comments.

}
}
}
Ok(Arc::new(BinaryArray::from(values)))
}
DataType::Utf8 => {
let input_array = array.as_any().downcast_ref::<StringArray>().unwrap();
let mut values: Vec<&str> = Vec::with_capacity(self.filtered_count);
let mut values: Vec<Option<&str>> = Vec::with_capacity(self.filtered_count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?

IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.

@@ -1267,6 +1267,36 @@ impl<OffsetSize: OffsetSizeTrait> GenericBinaryArray<OffsetSize> {
GenericBinaryArray::<OffsetSize>::from(array_data)
}

fn from_opt_vec(v: Vec<Option<&[u8]>>, data_type: DataType) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this code tested somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is tested (indirectly) in https://github.com/apache/arrow/pull/8303/files#diff-d7b0b7cde1850e8744ceda458c6dea81R700 -- but I think a more specific test would be valuable. I will add one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turned out to be a great call @jorgecarleitao -- I found a bug in this implementation while writing a test. Thank you for the suggestion. 💯

let c = filter(&a, &b).unwrap();
let d = c.as_ref().as_any().downcast_ref::<StringArray>().unwrap();
assert_eq!(2, d.len());
assert_eq!("hello", d.value(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we test the 3 quantities: d.is_null(0), d.value(0), d.is_null(1). Alternatively,

let expected = StringArray::from(vec![Some("hello"), None]);
assert_eq!(d, expected);

let c = filter(&a, &b).unwrap();
let d = c.as_ref().as_any().downcast_ref::<BinaryArray>().unwrap();
assert_eq!(2, d.len());
assert_eq!(b"hello", d.value(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2020

The same behaviour would occur with a BinaryArray, I haven't looked at this PR, but it's worthwhile to ensure that StringArray and BinaryArray are treated consistently.

Thanks @nevi-me -- yes I updated the code tried to handle both types in the same way (and including tests)

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2020

Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?

@jorgecarleitao I believe you are correct:

This program:

fn main() {
    println!("The size of a &str is {}", std::mem::size_of::<&str>());
    println!("The size of an Option<&str> is {}", std::mem::size_of::<Option<&str>>());
}

Produces the following on my machine:

The size of a &str is 16
The size of an Option<&str> is 16

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao you said:

IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.

Can you perhaps let me know what you mean by this? Perhaps there is an example in the code you are thinking of?

@@ -1267,6 +1267,36 @@ impl<OffsetSize: OffsetSizeTrait> GenericBinaryArray<OffsetSize> {
GenericBinaryArray::<OffsetSize>::from(array_data)
}

fn from_opt_vec(v: Vec<Option<&[u8]>>, data_type: DataType) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is tested (indirectly) in https://github.com/apache/arrow/pull/8303/files#diff-d7b0b7cde1850e8744ceda458c6dea81R700 -- but I think a more specific test would be valuable. I will add one.

}
}
}
Ok(Arc::new(BinaryArray::from(values)))
}
DataType::Utf8 => {
let input_array = array.as_any().downcast_ref::<StringArray>().unwrap();
let mut values: Vec<&str> = Vec::with_capacity(self.filtered_count);
let mut values: Vec<Option<&str>> = Vec::with_capacity(self.filtered_count);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't an Option of a reference leverages null pointer optimization (on which None is represented by the null pointer, e.g. rust-lang/rust#9378)?

Yes, I believe you are correct.

This program:

fn main() {
    println!("The size of a &str is {}", std::mem::size_of::<&str>());
    println!("The size of an Option<&str> is {}", std::mem::size_of::<Option<&str>>());
}

Produces the following on my machine:

The size of a &str is 16
The size of an Option<&str> is 16

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2020

I think I have addressed all your (very helpful) comments @jorgecarleitao .

@jorgecarleitao
Copy link
Member

@jorgecarleitao you said:

IMO we should follow up on this: for kernels we have been using a mutable buffer with null masks as much as possible.

Can you perhaps let me know what you mean by this? Perhaps there is an example in the code you are thinking of?

I am really, sorry, @alamb , I should have offered more context in the first place. :/

This in no way blocks this PR: IMO it is ready to merge if the relevant tests pass.

What I meant is that this code currently:

  • creates Vec<Option<T>> through an iteration
  • copies Vec<Option<T>> to the two buffers (when from_opt_vec is called)

it may be more efficient to create the buffers during the iteration, so that we avoid the copy (Vec -> buffers). In other words, the code in from_opt_vec could have been "injected" into the filter execution, where the MuttableBuffer and offsets and values buffer are created before the loop, and new elements are directly written to it. Does this any sense?

(as a side note, this is why I am proposing #8211 : IMO there is some boiler-plate copy-pasting to

  1. initialize buffers
  2. iterate
  3. create ArrayData from buffers

which will continue to grow as we add more kernels, and whose pattern seems to be a FromIter of fixed size)

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, really nice additions.

@alamb
Copy link
Contributor Author

alamb commented Sep 30, 2020

@jorgecarleitao -- yes thank you that makes a lot of sense. I have filed https://issues.apache.org/jira/browse/ARROW-10141 to track that

@alamb
Copy link
Contributor Author

alamb commented Oct 1, 2020

@andygrove / @jorgecarleitao / @nevi-me I wonder if this PR might be merged anytime soon (I have a downstream project relying on this change)

The integration test failure https://github.com/apache/arrow/pull/8303/checks?check_run_id=1187275161 seems due to a network failure (not anything with this PR):

Error:  Failed to execute goal org.apache.maven.plugins:maven-site-plugin:3.5.1:attach-descriptor (attach-descriptor) on project arrow-java-root: Execution attach-descriptor of goal org.apache.maven.plugins:maven-site-plugin:3.5.1:attach-descriptor failed: Plugin org.apache.maven.plugins:maven-site-plugin:3.5.1 or one of its dependencies could not be resolved: Could not transfer artifact org.apache.maven:maven-archiver:jar:2.5 from/to central (https://repo.maven.apache.org/maven2): Connection reset -> [Help 1]

@alamb
Copy link
Contributor Author

alamb commented Oct 1, 2020

I'll rebase to try and get a clean test run

@alamb alamb force-pushed the alamb/ARROW-10136-null-filter branch from 53f04f0 to b83d867 Compare October 1, 2020 12:52
@alamb
Copy link
Contributor Author

alamb commented Oct 1, 2020

All tests are passing now. 🎉

@andygrove andygrove closed this in a1157b7 Oct 1, 2020
@alamb alamb deleted the alamb/ARROW-10136-null-filter branch October 1, 2020 18:56
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
… filtering, add BinaryArray::from_opt_vec

When I use the `filter` kernel with Null strings, any input column that was Null turns into an empty string after filtering.

```
"foo"
"bar"
NULL
```
And the filter
```
true
true
true
```

Will result in
```
"foo"
"bar"
""
```

Rather than
```
"foo"
"bar"
NULL
```

It appears to work fine for primitive arrays (I'll comment inline).  I also added `BinaryArray::from_opt_vec` following the model of `PrimativeArray` and `StringArray` mostly so I could write a test.

Closes apache#8303 from alamb/alamb/ARROW-10136-null-filter

Authored-by: alamb <andrew@nerdnetworks.org>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
… filtering, add BinaryArray::from_opt_vec

When I use the `filter` kernel with Null strings, any input column that was Null turns into an empty string after filtering.

```
"foo"
"bar"
NULL
```
And the filter
```
true
true
true
```

Will result in
```
"foo"
"bar"
""
```

Rather than
```
"foo"
"bar"
NULL
```

It appears to work fine for primitive arrays (I'll comment inline).  I also added `BinaryArray::from_opt_vec` following the model of `PrimativeArray` and `StringArray` mostly so I could write a test.

Closes apache#8303 from alamb/alamb/ARROW-10136-null-filter

Authored-by: alamb <andrew@nerdnetworks.org>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants