Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Not able to load the existing index #78

Open
llm-finetune opened this issue Jan 16, 2025 · 3 comments
Open

Not able to load the existing index #78

llm-finetune opened this issue Jan 16, 2025 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@llm-finetune
Copy link

Hi I am working on a RAG application and trying to implement document indexing using pylate library. Below is the code snippet for creating the index: -

model = models.ColBERT(
model_name_or_path="lightonai/colbertv2.0",
)

index = indexes.Voyager(
index_folder="pylate-index",
index_name="test",
)

After the above code the index gets initialized.

documents_embeddings = model.encode(
documents,
batch_size=1,
is_query=False,
show_progress_bar=True,
)

After the above code the embeddings get stored in index.

However, when I want to load the index using below code, I am getting error. I have tried multiple things but couldn't get any solution.

index = indexes.Voyager(
index_folder="pylate-index",
index_name="test",
)

Note: - I am working on Windows

Any solution or guidance would be appreciated.

Thanks,

Error:
RuntimeError: Tried to read 18648 bytes from stream, but only received 974 bytes!

Error Trace
Traceback (most recent call last):
File "C:\Users\khand\AppData\Local\Programs\Python\Python311\Lib\runpy.py", line 198, in _run_module_as_main
return _run_code(code, main_globals, None,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\khand\AppData\Local\Programs\Python\Python311\Lib\runpy.py", line 88, in run_code
exec(code, run_globals)
File "c:\Users\khand.cursor\extensions\ms-python.debugpy-2024.6.0-win32-x64\bundled\libs\debugpy\adapter/../..\debugpy\launcher/../..\debugpy_main
.py", line 39, in
cli.main()
File "c:\Users\khand.cursor\extensions\ms-python.debugpy-2024.6.0-win32-x64\bundled\libs\debugpy\adapter/../..\debugpy\launcher/../..\debugpy/..\debugpy\server\cli.py", line 430, in main
run()
File "c:\Users\khand.cursor\extensions\ms-python.debugpy-2024.6.0-win32-x64\bundled\libs\debugpy\adapter/../..\debugpy\launcher/../..\debugpy/..\debugpy\server\cli.py", line 284, in run_file
runpy.run_path(target, run_name="main")
File "c:\Users\khand.cursor\extensions\ms-python.debugpy-2024.6.0-win32-x64\bundled\libs\debugpy_vendored\pydevd_pydevd_bundle\pydevd_runpy.py", line 321, in run_path
return _run_module_code(code, init_globals, run_name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\khand.cursor\extensions\ms-python.debugpy-2024.6.0-win32-x64\bundled\libs\debugpy_vendored\pydevd_pydevd_bundle\pydevd_runpy.py", line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "c:\Users\khand.cursor\extensions\ms-python.debugpy-2024.6.0-win32-x64\bundled\libs\debugpy_vendored\pydevd_pydevd_bundle\pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "D:\Ankit\MyWork\TestColbert\Test.py", line 11, in
index = indexes.Voyager("pylate-index",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "d:\Ankit\MyWork\TestColbert.venv\Lib\site-packages\pylate\indexes\voyager.py", line 122, in init
self.index = self._create_collection(
^^^^^^^^^^^^^^^^^^^^^^^^
File "d:\Ankit\MyWork\TestColbert.venv\Lib\site-packages\pylate\indexes\voyager.py", line 163, in _create_collection
return Index.load(index_path)
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Tried to read 18648 bytes from stream, but only received 974 bytes!

@NohTow
Copy link
Collaborator

NohTow commented Jan 17, 2025

Hello,

I tried with this snippet:
from pylate import indexes, models

model = models.ColBERT(
    model_name_or_path="lightonai/colbertv2.0",
)

index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="test",
)

documents = ["document 1", " document 2"]
documents_embeddings = model.encode(
    documents,
    batch_size=1,
    is_query=False,
    show_progress_bar=True,
)
index.add_documents(documents_embeddings=documents_embeddings)

And then

from pylate import indexes, models, retrieve

model = models.ColBERT(
    model_name_or_path="lightonai/colbertv2.0",
)

index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="test",
)

queries = ["hello", "how are you"]
queries_embeddings = model.encode(
    documents,
    batch_size=1,
    is_query=True,
    show_progress_bar=True,
)
retriever = retrieve.ColBERT(index=index)
print(retriever.retrieve(queries_embeddings))

And this works fine.
I believe these kind of error messages arise when the index is corrupted somehow, could you try removing the files from the pylate-index folder (or initing the index with override=True, before adding documents the first time, should be the same) and try again?
If you are able to replicate the index corruption, maybe I can add some guardrails to prevent the corruption.

@llm-finetune
Copy link
Author

llm-finetune commented Jan 18, 2025

Thanks @NohTow, for looking into this.

I tried with the code snippet you have provided above. I am getting below error in index.add_documents() statement: -

TypeError: Voyager.add_documents() missing 1 required positional argument: 'documents_ids'

My voyager version is 2.1.0
and pylate is 1.1.4
python 3.11

I need to maintain the document_ids also along with the document embeddings but when passing document-ids somehow the index is getting corrupted.

@NohTow
Copy link
Collaborator

NohTow commented Jan 21, 2025

Yeah I messed up when copying the boilerplate, you need to add the documents_ids when adding to the index:
index.add_documents(documents_embeddings=documents_embeddings, documents_ids=["1", "2"])

Refer to the documentation for more examples, besides the corruption that might have happened at first, if you clean everything and runs the boilerplates, it should work fine.

@NohTow NohTow self-assigned this Jan 27, 2025
@NohTow NohTow added the bug Something isn't working label Jan 27, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants