Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Feature: support all data types listed in scalar_kind_t in index_dense #469

Open
2 of 3 tasks
mbautin opened this issue Aug 22, 2024 · 1 comment
Open
2 of 3 tasks
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@mbautin
Copy link
Contributor

mbautin commented Aug 22, 2024

Describe what you are looking for

index_dense seems to only support these data types:

    add_result_t add(vector_key_t key, b1x8_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_b1x8); }
    add_result_t add(vector_key_t key, i8_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_i8); }
    add_result_t add(vector_key_t key, f16_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_f16); }
    add_result_t add(vector_key_t key, f32_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_f32); }
    add_result_t add(vector_key_t key, f64_t const* vector, std::size_t thread = any_thread(), bool force_vector_copy = true) { return add_(key, vector, thread, force_vector_copy, casts_.from_f64); }

The casts are instantiated only for these 5 types as well.

scalar_kind_t has much more than that:

enum class scalar_kind_t : std::uint8_t {
    unknown_k = 0,
    // Custom:
    b1x8_k = 1,
    u40_k = 2,
    uuid_k = 3,
    // Common:
    f64_k = 10,
    f32_k = 11,
    f16_k = 12,
    f8_k = 13,
    // Common Integral:
    u64_k = 14,
    u32_k = 15,
    u16_k = 16,
    u8_k = 17,
    i64_k = 20,
    i32_k = 21,
    i16_k = 22,
    i8_k = 23,
};

In particular, the SIFT1B dataset from http://corpus-texmex.irisa.fr/ seems to require unsigned char, and building a float-based index for that dataset seems wasteful.

If there is a good reason to only support those 5 data types, some documentation of the rationale would be great.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

C++ implementation

Contact Details

mbautin@gmail.com

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@mbautin mbautin added the enhancement New feature or request label Aug 22, 2024
ashvardanian added a commit that referenced this issue Aug 29, 2024
Separating vector-casting logic will make it easier
to extend the type system, potentially adding `bf16`
and `u8` high-level `add`, `get`,  and`search` APIs
down the road.

Related to #469

Co-authored-by: Mikhail Bautin <mbautin@gmail.com>
@ashvardanian
Copy link
Contributor

Some of those, indeed, make sense, @mbautin! Implementing all of them sounds like a code-bloat. How about adding only u8 and bf16 to casts_punned_t?

I've just pushed a commit that refactors the code logic making it easier to extend an API. Feel free to open PRs 🤗

@ashvardanian ashvardanian added good first issue Good for newcomers help wanted Extra attention is needed labels Sep 2, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants