Skip to content

Implement Utf8View for lpad scalar function #11941

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 1 commit into from
Aug 13, 2024

Conversation

Omega359
Copy link
Contributor

Which issue does this PR close?

Closes #11857

Rationale for this change

Support utf8view for lpad udf.

What changes are included in this PR?

Code, tests.

Are these changes tested?

Yes.

Are there any user-facing changes?

No

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 12, 2024
@Omega359 Omega359 changed the title Update LPAD scalar function to support Utf8View Implement Utf8View for lpad scalar function Aug 12, 2024
@Omega359 Omega359 marked this pull request as ready for review August 12, 2024 02:13
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive @Omega359 👏 I especially like the testing improvements of this PR

I mentioned some places this function could be made faster, but that can totally be done as a follow on (maybe waiting for someone for whom the performance of lpad is important). I can file a follow on ticket

s.push_str(string);
Ok(Some(s))
}
let mut s: String = " ".repeat(length - graphemes.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR just follows the existing pattern so I don't think any changes are needed, but I think using a GenericStringBuilder would be much faster here (as it wouldn't allocate a string for each output array element)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a test locally with using GenericStringBuilder vs what is in this PR and the differences are within +/- 10% for 1024 and 2048 element arrays for each of the 3 string types. Worth noting though is that timing between runs on my laptop is not really stable at all (+/- 10% or so between runs of the exact same code).

Interestingly the utf8view seems to be consistently slightly slower than just plain utf8 at least in this function.

lpad function/utf8 type/2048
                        time:   [715.78 µs 718.43 µs 721.89 µs]
                        
lpad function/largeutf8 type/2048
                        time:   [745.44 µs 759.62 µs 779.84 µs]
                       
lpad function/stringview type/2048
                        time:   [734.25 µs 751.13 µs 773.67 µs]
                       

let length = if length < 0 { 0 } else { length as usize };
if length == 0 {
Ok(Some("".to_string()))
if length < graphemes.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here I think GenericStringBuilder would likely be faster

@Omega359
Copy link
Contributor Author

I was looking at @Lordworms implementation in #11942 and I think it would make sense to align the two implementations. In some aspects I like his approach much more than the one I took in this PR.

@alamb
Copy link
Contributor

alamb commented Aug 12, 2024

I was looking at @Lordworms implementation in #11942 and I think it would make sense to align the two implementations. In some aspects I like his approach much more than the one I took in this PR.

Do you think we should merge this PR and work on improvements in a follow on PR?

@Omega359
Copy link
Contributor Author

Do you think we should merge this PR and work on improvements in a follow on PR?

I have no objections to that.

@alamb alamb merged commit 508da80 into apache:main Aug 13, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 13, 2024

Thanks @Omega359 -- let's merge this to get the initial implementation and tests in and then we can work on improvements as a follow on PR.

Let me know if you would like me to file a ticket for

I was looking at @Lordworms implementation in #11942 and I think it would make sense to align the two implementations. In some aspects I like his approach much more than the one I took in this PR.

@Omega359
Copy link
Contributor Author

Let me know if you would like me to file a ticket for

I was looking at @Lordworms implementation in #11942 and I think it would make sense to align the two implementations. In some aspects I like his approach much more than the one I took in this PR.

I've got the work done locally in a branch along with the GenericStringBuilder work and the benchmark to go with it. I'll create a ticket tomorrow for it thanks @alamb

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update LPAD scalar function to support Utf8View
2 participants