Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Customized text chunking #3251

Closed
sid8491 opened this issue May 11, 2023 · 4 comments
Closed

Customized text chunking #3251

sid8491 opened this issue May 11, 2023 · 4 comments
Labels

Comments

@sid8491
Copy link

sid8491 commented May 11, 2023

How do I write my own logic for text chunking?
Which classes do I need to extend, and how do I return the final chunking output?

Any examples/documentation would be appreciated.

@logan-markewich
Copy link
Collaborator

You can create your own chunking logic by extending the text splitter.

Then the text splitter gets passed into the node parser, which also gets passed into service context

Text Splitter:
https://github.com/jerryjliu/llama_index/blob/main/llama_index/langchain_helpers/text_splitter.py

Node Parser:
https://github.com/jerryjliu/llama_index/blob/main/llama_index/node_parser/simple.py

@sid8491
Copy link
Author

sid8491 commented May 14, 2023

can you give example of how to do it?

You can create your own chunking logic by extending the text splitter.

Then the text splitter gets passed into the node parser, which also gets passed into service context

Text Splitter: https://github.com/jerryjliu/llama_index/blob/main/llama_index/langchain_helpers/text_splitter.py

Node Parser: https://github.com/jerryjliu/llama_index/blob/main/llama_index/node_parser/simple.py

can you give example of how to do it?

@logan-markewich
Copy link
Collaborator

The best example is definitely the source code.

Not sure what you have in mind though, it might be possible to achieve what you want with other methods

@dreamshit
Copy link

dreamshit commented May 16, 2023

I don't know if it's what you want, that's how I achieved it:

class TxtParser(BaseParser):

    def _init_parser(self) -> Dict:
        return {}

    def parse_file(self, file: Path, errors: str = "ignore") -> str:
        pass

METIS_FILE_EXTRACTOR: Dict[str, BaseParser] = {
    ".csv": CSVParser(concat_rows=False),
    ".txt": TxtParser(),
}

documents = SimpleDirectoryReader(file_extractor=METIS_FILE_EXTRACTOR).load_data()

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants