Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix QdrantClient Document import issue and improve text processing fo… #311

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Tim-nocode
Copy link

…r STORM

Summary:

This commit updates the STORM repository to work with the latest versions of qdrant_client by:

Replacing the deprecated Document import from qdrant_client with PointStruct. Ensuring compatibility with RecursiveCharacterTextSplitter from LangChain by converting PointStruct into LangChain Document. Fixing potential issues with CSV parsing and content chunking before vectorization.

Key Changes:

  1. Fixed incompatibility with newer qdrant_client versions

Removed:
from qdrant_client import Document
Reason: In newer versions of qdrant_client, Document was removed and is no longer available.

Added instead:
from qdrant_client.models import PointStruct

Why? PointStruct is the correct way to structure documents before inserting them into Qdrant.

  1. Updated document processing to avoid conflicts with LangChain

Old version:
documents = [
Document(
page_content=row[content_column],
metadata={
"title": row.get(title_column, ""),
"url": row[url_column],
"description": row.get(desc_column, ""),
},
)
for row in df.to_dict(orient="records")
]

New version:
documents = [
PointStruct(
id=index, # Unique identifier
vector=[], # Empty vector (will be generated later)
payload={
"content": row[content_column],
"title": row.get(title_column, ""),
"url": row[url_column],
"description": row.get(desc_column, ""),
},
)
for index, row in enumerate(df.to_dict(orient="records"))
]

Why? This ensures compatibility with qdrant_client and allows storing metadata separately.

  1. Fixed compatibility with LangChain's RecursiveCharacterTextSplitter

Old version:
split_documents = text_splitter.split_documents(documents) Issue: PointStruct does not have a page_content attribute, which text_splitter requires. Fixed version:
from langchain.schema import Document as LangchainDocument documents_langchain = [
LangchainDocument(
page_content=doc.payload["content"],
metadata=doc.payload
)
for doc in documents
]

split_documents = text_splitter.split_documents(documents_langchain)

Why? RecursiveCharacterTextSplitter requires page_content, which PointStruct does not have. Converting PointStruct to LangChain Document resolves this issue.

  1. Ensured correct CSV parsing and encoding

Added sep="|" and encoding="utf-8" in pd.read_csv():

df = pd.read_csv(file_path, sep="|", encoding="utf-8")

Why?

Prevents issues where pandas treats the entire header row as a single column.

Ensures compatibility with datasets that use | as a separator.

  1. Batch processing optimization

Ensured that data is properly batched before sending to Qdrant:

num_batches = (len(split_documents) + batch_size - 1) // batch_size for i in tqdm(range(num_batches)):
start_idx = i * batch_size
end_idx = min((i + 1) * batch_size, len(split_documents))
qdrant.add_documents(
documents=split_documents[start_idx:end_idx],
batch_size=batch_size,
)

Why? Prevents timeout errors when handling large documents.

Ensures efficient memory usage and better API performance.

Impact & Benefits:

✅ Fixes compatibility issues with the latest qdrant_client versions.
✅ Ensures correct document chunking for LangChain's text splitter.
✅ Prevents "Content column not found" errors in CSV parsing. ✅ Improves stability when inserting large documents into Qdrant.

This commit ensures that STORM continues to work seamlessly with Qdrant and LangChain while providing better document processing support.

Next Steps:
Review and test with additional datasets.
Consider additional optimizations for embedding model selection.

…r STORM

Summary:

This commit updates the STORM repository to work with the latest versions of qdrant_client by:

Replacing the deprecated Document import from qdrant_client with PointStruct.
Ensuring compatibility with RecursiveCharacterTextSplitter from LangChain by converting PointStruct into LangChain Document.
Fixing potential issues with CSV parsing and content chunking before vectorization.

Key Changes:
1. Fixed incompatibility with newer qdrant_client versions

Removed:
from qdrant_client import Document
Reason: In newer versions of qdrant_client, Document was removed and is no longer available.

Added instead:
from qdrant_client.models import PointStruct

Why? PointStruct is the correct way to structure documents before inserting them into Qdrant.

2. Updated document processing to avoid conflicts with LangChain

Old version:
documents = [
    Document(
        page_content=row[content_column],
        metadata={
            "title": row.get(title_column, ""),
            "url": row[url_column],
            "description": row.get(desc_column, ""),
        },
    )
    for row in df.to_dict(orient="records")
]

New version:
documents = [
    PointStruct(
        id=index,  # Unique identifier
        vector=[],  # Empty vector (will be generated later)
        payload={
            "content": row[content_column],
            "title": row.get(title_column, ""),
            "url": row[url_column],
            "description": row.get(desc_column, ""),
        },
    )
    for index, row in enumerate(df.to_dict(orient="records"))
]

Why? This ensures compatibility with qdrant_client and allows storing metadata separately.

3. Fixed compatibility with LangChain's RecursiveCharacterTextSplitter

Old version:
split_documents = text_splitter.split_documents(documents)
Issue: PointStruct does not have a page_content attribute, which text_splitter requires.
Fixed version:
from langchain.schema import Document as LangchainDocument
documents_langchain = [
    LangchainDocument(
        page_content=doc.payload["content"],
        metadata=doc.payload
    )
    for doc in documents
]

split_documents = text_splitter.split_documents(documents_langchain)

Why? RecursiveCharacterTextSplitter requires page_content, which PointStruct does not have. Converting PointStruct to LangChain Document resolves this issue.

4. Ensured correct CSV parsing and encoding

Added sep="|" and encoding="utf-8" in pd.read_csv():

df = pd.read_csv(file_path, sep="|", encoding="utf-8")

Why?

Prevents issues where pandas treats the entire header row as a single column.

Ensures compatibility with datasets that use | as a separator.

5. Batch processing optimization

Ensured that data is properly batched before sending to Qdrant:

num_batches = (len(split_documents) + batch_size - 1) // batch_size
for i in tqdm(range(num_batches)):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(split_documents))
    qdrant.add_documents(
        documents=split_documents[start_idx:end_idx],
        batch_size=batch_size,
    )

Why? Prevents timeout errors when handling large documents.

Ensures efficient memory usage and better API performance.

Impact & Benefits:

✅ Fixes compatibility issues with the latest qdrant_client versions.
✅ Ensures correct document chunking for LangChain's text splitter.
✅ Prevents "Content column not found" errors in CSV parsing.
✅ Improves stability when inserting large documents into Qdrant.

This commit ensures that STORM continues to work seamlessly with Qdrant and LangChain while providing better document processing support.

Next Steps:
Review and test with additional datasets.
Consider additional optimizations for embedding model selection.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant