Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

arrow2 does _not_ refcount schema metadata #1805

Closed
Tracked by #1898
teh-cmc opened this issue Apr 10, 2023 · 3 comments
Closed
Tracked by #1898

arrow2 does _not_ refcount schema metadata #1805

teh-cmc opened this issue Apr 10, 2023 · 3 comments
Labels
🏹 arrow Apache Arrow 🚀 performance Optimization, memory use, etc ⛃ re_datastore affects the datastore itself

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Apr 10, 2023

All arrow2 arrays are defined roughly as the following:

pub struct Array {
    data_type: DataType,
    values: Buffer<T>,
    validity: Option<Bitmap>,
}

When you clone/slice/index an Array, you get another Array in roughly O(1) thanks to both the values and validity bitmaps being refcounted behind the scenes:

pub struct Buffer<T> {
    data: Arc<Bytes<T>>,
    offset: usize,
    length: usize,
}

pub struct Bitmap {
    bytes: Arc<Bytes<u8>>,
    offset: usize,
    length: usize,
    unset_bits: usize,
}

Well... not really, turns out the DataType is not refcounted, and it can get huge: it's a massive heap-recursive enum potentially filled with strings and such.

Say you have a ListArray that contains a bunch of StructArrays (i.e. a column of component data) and you want to extract references to the individual StructArrays in that list (i.e. the individual DataCells): each of these arrays is now going to carry a full copy of the StructArray's schema.

For tiny DataCells (which are very common in Rerun), the overhead is enormous.

@teh-cmc teh-cmc added 🏹 arrow Apache Arrow ⛃ re_datastore affects the datastore itself 🚀 performance Optimization, memory use, etc labels Apr 10, 2023
@jleibs
Copy link
Member

jleibs commented Apr 10, 2023

Looks like we might want to pull on this thread:

@teh-cmc
Copy link
Member Author

teh-cmc commented Apr 17, 2023

@teh-cmc
Copy link
Member Author

teh-cmc commented Oct 9, 2023

@teh-cmc teh-cmc closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
🏹 arrow Apache Arrow 🚀 performance Optimization, memory use, etc ⛃ re_datastore affects the datastore itself
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants