Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: support using general compression for numeric array #3020

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

niyue
Copy link
Contributor

@niyue niyue commented Oct 18, 2024

This PR aims to address issue #3019.

In this PR, a general compression encoder can now be selected for numeric arrays. I’ve reused the existing CompressedBufferEncoder struct from the codebase, which currently does not appear to be in use elsewhere.

@niyue niyue changed the title Support using general compression for numeric array feat: support using general compression for numeric array Oct 18, 2024
@github-actions github-actions bot added the enhancement New feature or request label Oct 18, 2024
compressed_bit_width as usize,
data_type.clone(),
)));
} else if let Some(compression) = field_meta.and_then(Self::get_field_compression) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When bitpacking is not applicable, check if field itself specifies some compression

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 93.33333% with 8 lines in your changes missing coverage. Please review.

Project coverage is 78.24%. Comparing base (f9024ce) to head (bb88df3).

Files with missing lines Patch % Lines
...st/lance-encoding/src/encodings/physical/binary.rs 61.53% 5 Missing ⚠️
...-encoding/src/encodings/physical/block_compress.rs 75.00% 2 Missing ⚠️
rust/lance-encoding/src/encoder.rs 98.98% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3020      +/-   ##
==========================================
+ Coverage   78.19%   78.24%   +0.04%     
==========================================
  Files         239      239              
  Lines       76782    76870      +88     
  Branches    76782    76870      +88     
==========================================
+ Hits        60043    60148     +105     
+ Misses      13669    13633      -36     
- Partials     3070     3089      +19     
Flag Coverage Δ
unittests 78.24% <93.33%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

let mut encoder: Option<Box<dyn ArrayEncoder>> = None;

if version >= LanceFileVersion::V2_1 {
if arrays[0].data_type() == data_type {
Copy link
Contributor Author

@niyue niyue Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace

I definitely think we should bitpack string offsets

If this is desirable, maybe the simplest change is to remove the condition if arrays[0].data_type() == data_type here so that bit packing would be applied. But this condition seems to be added prevously intentionally and I am not sure if this should be simply removed or not.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants