Skip to content

Conversation

dor-forer
Copy link
Collaborator

@dor-forer dor-forer commented May 29, 2025

Describe the changes in the pull request

  1. Change the preprocess call to use separate variables for storage_blob_size and query_blob_size, allowing the option of different vector sizes for each (like in the quantization preprocessor).
  2. Added the QuantPreprocessor, which receives a dim in the constructor and calculates the storage_bytes_count.
    The QuantPreprocessor takes a vector of float32 with size dim, and converts it to a uint8 vector with the following structure:
    [dim x uint8, min_value, detla]
    With a storage_bytes_count of: dim * sizeof(uint8_t) + 2 * sizeof(float).
    In the preprocess function, there are 3 main cases:
    a. Allocate storage_blob if it was never allocated.
    b. If storage_blob and query_blob share the same memory, allocate new memory for storage_blob.
    c. If storage_blob was allocated and they don't share memory, check if enough memory was allocated for the new vector, if not, reallocate.
  3. Added tests to test the following scenarios:
    QuantizationTest:
    Tests only one preprocessor of type QuantPreprocessor, checks that the preprocessor works as expected.
    QuantizationTestWithCosine:
    Tests QuantPreprocessor after normalization using the CosinePreprocessor.
    ReallocateVectorQuantizationTest:
    Tests the scenario of QuantPreprocessor handling memory that was already allocated, and needs to be reallocated.
    (using DummyStoragePreprocessor to allocate the storage_blob memory withouth allocating the query_blob.
    ReallocateVectorCosineQuantizationTest:
    Tests the relocating scenario in the case where query_blob and storage_blob share memory.
    QuantizationInPlaceTest:
    Tests the case where enough memory was already allocated, and needs to use the quantization in place.

Which issues this PR fixes
1. MOD-9238

Main objects this PR modified
1.Preprocessors
2. ...

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

meiravgri and others added 17 commits May 8, 2025 04:52
…er file

add logic to determine the cosine processed_bytes_count

add processed_bytes_count to the cosine component

change blob_size param to the original blob size instead of processed_bytes_count in all preprocessing functions.
add DummyChangeAllocSizePreprocessor class to test pp that changes the size

add tests
one will fail because the input blob size is not updated
…tes_count

add valgrind macro

value_to_add of DummyQueryPreprocessor and DummyMixedPreprocessor is now DataType instead of int
Copy link

codecov bot commented May 29, 2025

Codecov Report

Attention: Patch coverage is 74.35897% with 20 lines in your changes missing coverage. Please review.

Project coverage is 96.00%. Comparing base (5ba2efa) to head (a8aee99).

Files with missing lines Patch % Lines
src/VecSim/spaces/computer/preprocessors.h 72.97% 20 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #688      +/-   ##
==========================================
- Coverage   96.29%   96.00%   -0.29%     
==========================================
  Files         111      111              
  Lines        6288     6365      +77     
==========================================
+ Hits         6055     6111      +56     
- Misses        233      254      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dor-forer dor-forer requested a review from Copilot June 3, 2025 14:15
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for dual-size preprocessing of storage and query blobs across existing preprocessors and updates tests to cover quantization and dummy preprocessor scenarios.
Key changes:

  • Introduce a new two-size preprocess overload in PreprocessorInterface and implement it in all preprocessors.
  • Update MultiPreprocessorsContainer to invoke the two-size overload and track separate storage/query sizes.
  • Add extensive unit tests for quantization and dummy preprocessors covering both single- and multi-preprocessor chains.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
tests/unit/test_hnsw_tiered.cpp Add two-size preprocess override for PreprocessorDoubleValue.
tests/unit/test_components.cpp Implement two-size overloads for dummy preprocessors and add tests.
src/VecSim/spaces/computer/preprocessors.h Declare and implement two-size preprocess in interface and classes
src/VecSim/spaces/computer/preprocessor_container.h Update container to call two-size preprocess and track sizes.
Comments suppressed due to low confidence (1)

src/VecSim/spaces/computer/preprocessor_container.h:175

  • Typo in comment: Sepreated should be Separate.
// Sepreated variables for the storage blob size and query_blob_size,

// i.e., both blobs are of the same size.
// In order to use different sizes, the preprocessor should be modified.
assert(storage_blob_size == query_blob_size);
preprocess(original_blob, storage_blob, query_blob, storage_blob_size, alignment);
Copy link
Preview

Copilot AI Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After calling the single-size overload, query_blob_size is never updated. You should assign query_blob_size = storage_blob_size to keep sizes in sync.

Suggested change
preprocess(original_blob, storage_blob, query_blob, storage_blob_size, alignment);
preprocess(original_blob, storage_blob, query_blob, storage_blob_size, alignment);
query_blob_size = storage_blob_size;

Copilot uses AI. Check for mistakes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that storage_blob is never updated. So does it affect it?

@dor-forer dor-forer requested a review from meiravgri June 3, 2025 15:01
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants