Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: append, compact, cleanup with stable row id #2412

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[workspace]
members = [
"java/core/lance-jni",
# "java/core/lance-jni",
"rust/lance",
"rust/lance-arrow",
"rust/lance-core",
Expand Down
11 changes: 7 additions & 4 deletions docs/format.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ deleted from the fragment.

.. image:: _static/fragment_structure.png

Every row has a unique id, which is an u64 that is composed of two u32s: the
fragment id and the local row id. The local row id is just the index of the
Every row has a unique row address, which is an u64 that is composed of two u32s: the
fragment id and the local row offset. The local row offset is just the index of the
row in the data files.

File Structure
Expand Down Expand Up @@ -201,12 +201,12 @@ Deletion
--------

Rows can be marked deleted by adding a deletion file next to the data in the
``_deletions`` folder. These files contain the indices of rows that have between
``_deletions`` folder. These files contain the local offsets of rows that have between
deleted for some fragment. For a given version of the dataset, each fragment can
have up to one deletion file. Fragments that have no deleted rows have no deletion
file.

Readers should filter out row ids contained in these deletion files during a
Readers should filter out row offsets contained in these deletion files during a
scan or ANN search.

Deletion files come in two flavors:
Expand Down Expand Up @@ -424,6 +424,9 @@ row address
example, if the row address is (42, 9), then the row is in the 42rd fragment
and is the 10th row in that fragment.

row offset
The index of a row in a fragment. This is a u32 that is unique within a fragment.

row id sequence
The sequence of row ids in a fragment.

Expand Down
4 changes: 3 additions & 1 deletion docs/read_and_write.rst
Original file line number Diff line number Diff line change
Expand Up @@ -469,10 +469,12 @@ During compaction, Lance can also remove deleted rows. Rewritten fragments will
not have deletion files. This can improve scan performance since the soft deleted
rows don't have to be skipped during the scan.

When files are rewritten, the original row ids are invalidated. This means the
When files are rewritten, the original row addresses are invalidated. This means the
affected files are no longer part of any ANN index if they were before. Because
of this, it's recommended to rewrite files before re-building indices.

.. TODO: remove this last comment once move-stable row ids are default.


Object Store Configuration
--------------------------
Expand Down
6 changes: 3 additions & 3 deletions protos/table.proto
Original file line number Diff line number Diff line change
Expand Up @@ -272,14 +272,14 @@ message DataFile {
// where {extension} is `.arrow` or `.bin` depending on the type of deletion.
message DeletionFile {
// Type of deletion file, which varies depending on what is the most efficient
// way to store the deleted row ids. If none, then will be unspecified. If there are
// way to store the deleted row offsets. If none, then will be unspecified. If there are
// sparsely deleted rows, then ARROW_ARRAY is the most efficient. If there are
// densely deleted rows, then BIT_MAP is the most efficient.
enum DeletionFileType {
// Deletion file is a single Int32Array of deleted row ids. This is stored as
// Deletion file is a single Int32Array of deleted row offsets. This is stored as
// an Arrow IPC file with one batch and one column. Has a .arrow extension.
ARROW_ARRAY = 0;
// Deletion file is a Roaring Bitmap of deleted row ids. Has a .bin extension.
// Deletion file is a Roaring Bitmap of deleted row offsets. Has a .bin extension.
BITMAP = 1;
}

Expand Down
4 changes: 2 additions & 2 deletions python/python/lance/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -399,7 +399,7 @@ def to_table(
prefilter: bool, default False
Run filter before the vector search.
with_row_id: bool, default False
Return physical row ID.
Return row ID.
use_stats: bool, default True
Use stats pushdown during filters.

Expand Down Expand Up @@ -2046,7 +2046,7 @@ def prefilter(self, prefilter: bool) -> ScannerBuilder:
return self

def with_row_id(self, with_row_id: bool = True) -> ScannerBuilder:
"""Enable returns with physical row IDs."""
"""Enable returns with row IDs."""
self._with_row_id = with_row_id
return self

Expand Down
5 changes: 5 additions & 0 deletions rust/lance-core/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,14 @@ pub use error::{Error, Result};

/// Column name for the meta row ID.
pub const ROW_ID: &str = "_rowid";
/// Column name for the meta row address.
pub const ROW_ADDR: &str = "_rowaddr";

lazy_static::lazy_static! {
/// Row ID field. This is nullable because its validity bitmap is sometimes used
/// as a selection vector.
pub static ref ROW_ID_FIELD: ArrowField = ArrowField::new(ROW_ID, DataType::UInt64, true);
/// Row address field. This is nullable because its validity bitmap is sometimes used
/// as a selection vector.
pub static ref ROW_ADDR_FIELD: ArrowField = ArrowField::new(ROW_ADDR, DataType::UInt64, true);
}
2 changes: 1 addition & 1 deletion rust/lance-core/src/utils/deletion.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ use roaring::RoaringBitmap;
const BITMAP_THRESDHOLD: usize = 5_000;
// TODO: Benchmark to find a better value.

/// Represents a set of deleted row ids in a single fragment.
/// Represents a set of deleted row offsets in a single fragment.
#[derive(Debug, Clone)]
pub enum DeletionVector {
NoDeletions,
Expand Down
10 changes: 5 additions & 5 deletions rust/lance-core/src/utils/mask.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,19 @@ use crate::Result;

use super::address::RowAddress;

/// A row id mask to select or deselect particular row ids
/// A row id mask to select or deselect particular row addresses
///
/// If both the allow_list and the block_list are Some then the only selected
/// row ids are those that are in the allow_list but not in the block_list
/// row addresses are those that are in the allow_list but not in the block_list
/// (the block_list takes precedence)
///
/// If both the allow_list and the block_list are None (the default) then
/// all row ids are selected
/// all row addresses are selected
#[derive(Clone, Debug, Default)]
pub struct RowIdMask {
/// If Some then only these row ids are selected
/// If Some then only these row addresses are selected
pub allow_list: Option<RowIdTreeMap>,
/// If Some then these row ids are not selected.
/// If Some then these row addresses are not selected.
pub block_list: Option<RowIdTreeMap>,
}

Expand Down
Loading
Loading