Skip to content

Add natural language filtering and sorting #85

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Adityav369
Copy link
Collaborator

@Adityav369 Adityav369 commented Apr 15, 2025

Summary by mrge

Added natural language filtering and sorting capabilities for document queries. This feature enables users to search and organize documents using simple English expressions instead of complex query syntax.

  • New Features

    • Built a smart query engine that interprets natural language filters like "show me only PDF documents" or "documents created after January 2023"
    • Added sorting capabilities using expressions like "sort by creation date" or "sort by relevance"
    • Created a fluent query builder API for both sync and async Python clients
    • Implemented two-stage filtering that first checks metadata before examining content for efficiency
  • Refactors

    • Enhanced document model to better support content-based filtering
    • Added configuration options for the smart query LLM model

Co-authored-by: Arnav Agrawal <aa779@cornell.edu>
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 045cb8e in 2 minutes and 17 seconds

More details
  • Looked at 1769 lines of code in 11 files
  • Skipped 0 files when reviewing.
  • Skipped posting 13 drafted comments based on config settings.
1. sdks/python/morphik/sync.py:1460
  • Draft comment:
    Ensure proper exception handling when parsing responses from _request. Currently response.json() is used without catching JSONDecodeError, which could lead to uninformative exceptions if the server returns unexpected content.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
2. sdks/python/morphik/sync.py:1510
  • Draft comment:
    Similar file handling code appears repeatedly (e.g. ingest_file, ingest_files). Consider refactoring file opening/closing into a shared utility function or using a context manager (e.g. contextlib.closing) to ensure files are always properly closed.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
3. sdks/python/morphik/sync.py:2190
  • Draft comment:
    When updating document metadata via update_document_by_filename_metadata, the code makes a subsequent call to update the filename by calling the update_text endpoint. Ensure that this two-step update consistently preserves the metadata, or consider offering a dedicated endpoint for metadata+filename update.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
4. sdks/python/morphik/sync.py:1290
  • Draft comment:
    HTTP request header management: the code manually sets Content-Type header for JSON requests. Consider leveraging httpx's built-in JSON support (passing json=data) to avoid manual header management and potential mismatches.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
5. sdks/python/morphik/sync.py:1388
  • Draft comment:
    For file ingestion methods, consider adding more robust error handling around file I/O (e.g. file not found or permission errors) and ensuring that opened file objects are always closed, possibly using contextlib.ExitStack for multiple files.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
6. sdks/python/morphik/sync.py:1720
  • Draft comment:
    Using POST for listing documents and batch retrieval may not be RESTful. If this is intentional for performance reasons, ensure API documentation clearly explains these non-standard choices.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
7. sdks/python/morphik/sync.py:1385
  • Draft comment:
    Avoid duplicating folder and end-user scoping across multiple methods. Consider extracting a helper method that adds these parameters to every request payload, reducing repeated conditionals.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.
8. sdks/python/morphik/sync.py:1440
  • Draft comment:
    Repeated assignment of doc._client = self after parsing document responses could be centralized in the parsing helper. This would ensure consistency and reduce boilerplate.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
9. sdks/python/morphik/sync.py:2502
  • Draft comment:
    In delete_document_by_filename, consider handling the case where the document isn’t found. Currently it proceeds to delete without checking if the parsed document is valid.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
10. sdks/python/morphik/sync.py:1290
  • Draft comment:
    Consider wrapping requests in try/except blocks or providing a custom exception for HTTP errors to offer friendlier error messages for client‐users.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
11. sdks/python/morphik/sync.py:521
  • Draft comment:
    In list_documents, the folder is not passed as a parameter (end_user_id remains None). Verify whether this behavior is intentional or if folder scoping should be enforced.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
12. sdks/python/morphik/sync.py:1869
  • Draft comment:
    Several update methods use 'update_strategy' with only the 'add' option supported. Consider documenting limitations or planning extension for other strategies in a TODO.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
13. sdks/python/morphik/sync.py:1270
  • Draft comment:
    Overall, consider consolidating duplicate logic between async and sync clients to a common base (e.g. shared _request logic and helpers) to ease maintenance and avoid subtle discrepancies between implementations.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_kQhmXAWObclR3qqK


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link

@mrge-io mrge-io bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mrge found 28 issues across 11 files. View them in mrge.io

log_prefix="Metadata filter evaluation response",
log_suffix=f"for {document.external_id}",
)
return response.result
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No validation of LLM result against expected range of values before returning

Returns:
Self for method chaining
"""
self._filter_predicate = predicate
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter() method doesn't validate that the predicate parameter is not empty or None before assigning it

end_user_id=self._end_user_id
)

response = await self._client._request("POST", "smart_query", data=request)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execute() method directly returns the API response without parsing or validating it matches the expected List[str] return type

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants