Skip to content

Use RecursiveJsonSplitter when learning JSON files #1036

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dlqqq
Copy link
Member

@dlqqq dlqqq commented Oct 16, 2024

As-stated in title. Follow-up to #1024.

@dlqqq dlqqq added the enhancement New feature or request label Oct 16, 2024
Copy link
Collaborator

@srdas srdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recursive splitter throws the following error:
image
This is presumably because it is a recursive splitter that can parse the JSON file without a chunk size requirement. If so, this would mean chunk overlap is not needed as well.

Same error with different LLMs.

@dlqqq
Copy link
Member Author

dlqqq commented Oct 16, 2024

Even after dropping the arguments, the JSON splitter still raises an exception:

2024-10-16 15:38:09,520 - distributed.worker - ERROR - Compute Failed
Key:       split_document-081f610f-c434-4159-b621-79c44a8909bb
State:     executing
Function:  split_document
args:      (Document(metadata={'path': '/Volumes/workplace/jupyter-ai/package.json', 'sha256': b']\xdb\xa9Y(\x15`\xd5\x89t\xd6\xae"+&\xe1\xfe\xe0\x11\xa3G\x934\n\\y\xc3\x85U\x01\xb65', 'extension': '.json'}, page_content='{\n  "name": "@jupyter-ai/monorepo",\n  "version": "2.25.0",\n  "description": "A generative AI extension for JupyterLab",\n  "private": true,\n  "keywords": [\n    "jupyter",\n    "jupyterlab",\n    "jupyterlab-extension"\n  ],\n  "homepage": "https://github.com/jupyterlab/jupyter-ai",\n  "bugs": {\n    "url": "https://github.com/jupyterlab/jupyter-ai/issues",\n    "email": "jupyter@googlegroups.com"\n  },\n  "license": "BSD-3-Clause",\n  "author": {\n    "name": "Project Jupyter",\n    "email": "jupyter@googlegroups.com"\n  },\n  "workspaces": [\n    ".",\n    "packages/*"\n  ],\n  "scripts": {\n    "build": "lerna run build --stream",\n    "build:core": "lerna run build --stream --scope \\"@jupyter-ai/core\\"",\n    "build:prod": "lerna run build:prod --stream",\n    "clean":
kwargs:    {}
Exception: "IndexError('list index out of range')"
Traceback: '  File "/Volumes/workplace/jupyter-ai/packages/jupyter-ai/jupyter_ai/document_loaders/directory.py", line 107, in split_document\n    return splitter.split_documents([document])\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/base.py", line 96, in split_documents\n    return self.create_documents(texts, metadatas=metadatas)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Volumes/workplace/jupyter-ai/packages/jupyter-ai/jupyter_ai/document_loaders/splitter.py", line 31, in create_documents\n    for chunk in self.split_text(text, metadata):\n                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Volumes/workplace/jupyter-ai/packages/jupyter-ai/jupyter_ai/document_loaders/splitter.py", line 22, in split_text\n    return splitter.split_text(text)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 106, in split_text\n    chunks = self.split_json(json_data=json_data, convert_lists=convert_lists)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 91, in split_json\n    chunks = self._json_split(json_data)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 78, in _json_split\n    self._set_nested_dict(chunks[-1], current_path, data)\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 32, in _set_nested_dict\n    d[path[-1]] = value\n      ~~~~^^^^\n'

It doesn't seem like RecursiveJsonSplitter is well-supported, since it seems to have a different interface than all the other splitters we use from LangChain. I'm putting this in draft status as there doesn't seem to be a clear path forward; may close this next week, or mark it as ready if I figure something out.

@dlqqq dlqqq marked this pull request as draft October 16, 2024 22:49
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants