Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Multi-pdf Capabilities #13

Merged
merged 2 commits into from
Nov 22, 2024
Merged

Multi-pdf Capabilities #13

merged 2 commits into from
Nov 22, 2024

Conversation

dandonarahul2002
Copy link
Contributor

Enhanced Multi-PDF RAG Capabilities and Optimized Reranking

Overview

This pull request significantly improves our RAG (Retrieval-Augmented Generation) system by extending single-PDF capabilities to support multiple PDFs and implementing an optimized reranking algorithm.

Key Changes

1. Multi-PDF RAG Support

  • Modified rag-utils.ts to handle multiple PDF documents simultaneously
  • Enhanced similarity search to work across multiple vector databases

2. Optimized Reranking Algorithm

Implemented a new bm25Rerank function with the following optimizations:

  • Preprocessed query terms to filter out single-character words
  • Precomputed IDF scores for improved efficiency
  • Utilized a single regex for term matching, reducing string operations
  • Implemented more efficient term frequency counting using a Map
  • Improved BM25 score calculation for better result ranking
  • Reset parameters of RecursiveCharacterTextSplitter to default values as it showed better results while manual testing

3. Type Safety Improvements

  • Added a new ScoredDocument interface extending Document to include a score property
  • Updated similaritySearch function to use the new bm25Rerank function, returning ScoredDocument[]

4. Text Splitting Adjustment

  • Reset parameters of RecursiveCharacterTextSplitter to default values based on improved results from manual testing

Performance Impact

These changes are expected to significantly improve the accuracy of our RAG system, particularly for queries involving multiple PDFs or large document sets.

Next Steps

  • Potential to improvise Reranking using Cross-Encoders (Couldn't find the funtionality yet to support Js(ONNX) models for sBert)
  • Explore potential for further optimizations in vector search and embedding processes

Please review these changes, paying particular attention to the reranking algorithm and multi-PDF handling logic.

@kartikm7 kartikm7 merged commit bc87ab8 into kartikm7:master Nov 22, 2024
@kartikm7
Copy link
Owner

Thank you so much!

@dandonarahul2002
Copy link
Contributor Author

dandonarahul2002 commented Nov 23, 2024 via email

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants