Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Enhancement]: get doc ids by batch #40607

Open
1 task done
SpadeA-Tang opened this issue Mar 12, 2025 · 3 comments
Open
1 task done

[Enhancement]: get doc ids by batch #40607

SpadeA-Tang opened this issue Mar 12, 2025 · 3 comments
Labels
kind/enhancement Issues or changes related to enhancement

Comments

@SpadeA-Tang
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

What would you like to be added?

Milvus row id does not match with tantivy doc id, so we first get the tantivy doc id produced by query,
then we search the related doc id, which stores in the "doc_id" field with type fast_field.

This one by one process is not efficient: 1. dynamic dispatch is involed when getting doc id in "doc_id" field, 2. we cannot benefit from SIMD optimization. 3. memory locality and so on.

Why is this needed?

No response

Anything else?

No response

@SpadeA-Tang SpadeA-Tang added the kind/enhancement Issues or changes related to enhancement label Mar 12, 2025
@xiaofan-luan
Copy link
Collaborator

Is there an existing issue for this?

  • I have searched the existing issues

What would you like to be added?

Milvus row id does not match with tantivy doc id, so we first get the tantivy doc id produced by query, then we search the related doc id, which stores in the "doc_id" field with type fast_field.

This one by one process is not efficient: 1. dynamic dispatch is involed when getting doc id in "doc_id" field, 2. we cannot benefit from SIMD optimization. 3. memory locality and so on.

Why is this needed?

No response

Anything else?

No response

is it a good idea to have a specialized map implementation for doc id to accelerate

@xiaofan-luan
Copy link
Collaborator

instead of optimizing filter, we need a data structure really good at retrieve, like a hash map

@SpadeA-Tang
Copy link
Contributor Author

#40608 optimizes it by utilizing tantivy's internal batch API and can achieve better performance. @xiaofan-luan

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/enhancement Issues or changes related to enhancement
Projects
None yet
Development

No branches or pull requests

2 participants