Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Issue when ingesting a file and adding it to a collection right after #1435

Open
jeremi opened this issue Oct 20, 2024 · 2 comments
Open

Issue when ingesting a file and adding it to a collection right after #1435

jeremi opened this issue Oct 20, 2024 · 2 comments

Comments

@jeremi
Copy link

jeremi commented Oct 20, 2024

Describe the bug

My code is like this:

                ingest_response = self.r2r.ingest_files(
                    file_paths=[document.file_path],
                    metadatas=[metadata]
                )
                document_id = ingest_response['results'][0]['document_id']
                try:
                    self.r2r.assign_document_to_collection(document_id, self.r2r.collection_id)
                except Exception as e:
                    logger.error(f"Error assigning document to collection: {str(e)}")

Oftentimes, I'll get an error when calling assign_document_to_collection because it cannot find the document in the database.

Looking at the code, the row in document_info is not created before the result of assign_document_to_collection is returned. It is done in the ingest-files workflow:

raw_message: dict[str, Union[str, None]] = await self.orchestration_provider.run_workflow( # type: ignore
"ingest-files",
{"request": workflow_input},
options={
"additional_metadata": {
"document_id": str(document_id),
}
},
)
raw_message["document_id"] = str(document_id)
messages.append(raw_message)

So when calling the assign_document_to_collection, the document_info record does not exist yet:

document_check_query = f"""
SELECT 1 FROM {self._get_table_name('document_info')}
WHERE document_id = $1
"""
document_exists = await self.fetchrow_query(
document_check_query, [document_id]
)
if not document_exists:
raise R2RException(
status_code=404, message="Document not found"
)

How to solve this? Would it be possible to pass also the collection_id when calling the ingest_files? I noticed this workflow already added the file to the default collection.

@emrgnt-cmplxty
Copy link
Contributor

We are planning on extending the ingest_files endpoint to support exactly the behavior you outline above. Several other developers have requested this exact same functionality.

As for your other question around document info creation, there is a specific reason behind this implementation. In order to properly assign a document to a collection we must update the collection ids of the underlying chunks. It would have required non-trivial engineering work to have the implementation align with what you describe, so instead we add the document to the collection after ingestion is complete.

@jeremi
Copy link
Author

jeremi commented Oct 22, 2024

But as the default collection is also added to the chunk, the other collection_id could be in a context like for the metadata and added at the same time, no?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants