Use `RecursiveJsonSplitter` when learning JSON files #1036

dlqqq · 2024-10-16T16:43:45Z

As-stated in title. Follow-up to #1024.

srdas

The recursive splitter throws the following error:

This is presumably because it is a recursive splitter that can parse the JSON file without a chunk size requirement. If so, this would mean chunk overlap is not needed as well.

Same error with different LLMs.

dlqqq · 2024-10-16T22:48:59Z

Even after dropping the arguments, the JSON splitter still raises an exception:

2024-10-16 15:38:09,520 - distributed.worker - ERROR - Compute Failed
Key:       split_document-081f610f-c434-4159-b621-79c44a8909bb
State:     executing
Function:  split_document
args:      (Document(metadata={'path': '/Volumes/workplace/jupyter-ai/package.json', 'sha256': b']\xdb\xa9Y(\x15`\xd5\x89t\xd6\xae"+&\xe1\xfe\xe0\x11\xa3G\x934\n\\y\xc3\x85U\x01\xb65', 'extension': '.json'}, page_content='{\n  "name": "@jupyter-ai/monorepo",\n  "version": "2.25.0",\n  "description": "A generative AI extension for JupyterLab",\n  "private": true,\n  "keywords": [\n    "jupyter",\n    "jupyterlab",\n    "jupyterlab-extension"\n  ],\n  "homepage": "https://github.com/jupyterlab/jupyter-ai",\n  "bugs": {\n    "url": "https://github.com/jupyterlab/jupyter-ai/issues",\n    "email": "jupyter@googlegroups.com"\n  },\n  "license": "BSD-3-Clause",\n  "author": {\n    "name": "Project Jupyter",\n    "email": "jupyter@googlegroups.com"\n  },\n  "workspaces": [\n    ".",\n    "packages/*"\n  ],\n  "scripts": {\n    "build": "lerna run build --stream",\n    "build:core": "lerna run build --stream --scope \\"@jupyter-ai/core\\"",\n    "build:prod": "lerna run build:prod --stream",\n    "clean":
kwargs:    {}
Exception: "IndexError('list index out of range')"
Traceback: '  File "/Volumes/workplace/jupyter-ai/packages/jupyter-ai/jupyter_ai/document_loaders/directory.py", line 107, in split_document\n    return splitter.split_documents([document])\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/base.py", line 96, in split_documents\n    return self.create_documents(texts, metadatas=metadatas)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Volumes/workplace/jupyter-ai/packages/jupyter-ai/jupyter_ai/document_loaders/splitter.py", line 31, in create_documents\n    for chunk in self.split_text(text, metadata):\n                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Volumes/workplace/jupyter-ai/packages/jupyter-ai/jupyter_ai/document_loaders/splitter.py", line 22, in split_text\n    return splitter.split_text(text)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 106, in split_text\n    chunks = self.split_json(json_data=json_data, convert_lists=convert_lists)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 91, in split_json\n    chunks = self._json_split(json_data)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 78, in _json_split\n    self._set_nested_dict(chunks[-1], current_path, data)\n  File "/Users/dlq/micromamba/envs/jai-test/lib/python3.11/site-packages/langchain_text_splitters/json.py", line 32, in _set_nested_dict\n    d[path[-1]] = value\n      ~~~~^^^^\n'

It doesn't seem like RecursiveJsonSplitter is well-supported, since it seems to have a different interface than all the other splitters we use from LangChain. I'm putting this in draft status as there doesn't seem to be a clear path forward; may close this next week, or mark it as ready if I figure something out.

use RecursiveJsonSplitter when learning JSON files

027a7ec

dlqqq added the enhancement New feature or request label Oct 16, 2024

srdas reviewed Oct 16, 2024

View reviewed changes

do not pass unsupported arguments to RecursiveJsonSplitter

7be57ef

dlqqq marked this pull request as draft October 16, 2024 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `RecursiveJsonSplitter` when learning JSON files #1036

Use `RecursiveJsonSplitter` when learning JSON files #1036

dlqqq commented Oct 16, 2024

srdas left a comment •

edited

Loading

dlqqq commented Oct 16, 2024

Use RecursiveJsonSplitter when learning JSON files #1036

Are you sure you want to change the base?

Use RecursiveJsonSplitter when learning JSON files #1036

Conversation

dlqqq commented Oct 16, 2024

srdas left a comment • edited Loading

Choose a reason for hiding this comment

dlqqq commented Oct 16, 2024

Use `RecursiveJsonSplitter` when learning JSON files #1036

Use `RecursiveJsonSplitter` when learning JSON files #1036

srdas left a comment •

edited

Loading