Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Question]: how to control the chunking of documents according to chapter number etc? #12605

Closed
1 task done
justinzyw opened this issue Apr 5, 2024 · 6 comments
Closed
1 task done
Labels
question Further information is requested

Comments

@justinzyw
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

by analysing the docstore, I realized a problem that it might happen that the end of a previous chapter and the start of the next chapter were chunked into one node. This results in the fact that llm misunderstood the content of the node, tried to mix the two parts together and generated incorrect responses.

Is there a way to resolve this problem please?

@justinzyw justinzyw added the question Further information is requested label Apr 5, 2024
Copy link

dosubot bot commented Apr 5, 2024

Hey @justinzyw, good to see you back! Looks like we've got another puzzle on our hands. I'm on it and will get back to you with some insights shortly.

@justinzyw
Copy link
Author

wonder whether there is a way to chunk according to chapter numbers? say .... chunk at end of chapter 1, chapter 2, etc...

Copy link

dosubot bot commented Apr 5, 2024

To address the issue of chunking documents by chapter, you can leverage the hierarchical structure of the document in the DocugamiReader class within the llama_index/readers/docugami/base.py file. By default, the parent_hierarchy_levels attribute is set to 0, meaning it does not include parent chunk information in the chunk metadata. Adjusting this attribute to include parent information allows for the creation of chunks that are aware of their hierarchical relationship within the document, thus keeping headings and their content together.

Here's a conceptual adjustment to the DocugamiReader class:

class DocugamiReader(BaseReader):
    parent_hierarchy_levels: int = 1  # Adjusted to include parent chunk information

    def _build_framework_chunk(self, dg_chunk: Chunk) -> Document:
        if dg_chunk.parent:
            framework_parent_chunk = self._build_framework_chunk(dg_chunk.parent)
            parent_id = framework_parent_chunk.metadata.get(ID_KEY)
            if parent_id:
                metadata[self.parent_id_key] = parent_id

This ensures each chunk is aware of its parent, allowing for the grouping of headings with their corresponding content.

Additionally, consider implementing a custom sentence splitter that identifies chapter breaks, to be used with the SemanticSplitterNodeParser. Here's an example:

def custom_sentence_and_chapter_splitter(text: str) -> List[str]:
    chapter_pattern = re.compile(r'\bChapter \d+\b', re.IGNORECASE)
    sentences = split_by_sentence_tokenizer(text)
    chapters_and_sentences = []
    for sentence in sentences:
        if chapter_pattern.search(sentence):
            chapters_and_sentences.append(sentence)
        else:
            chapters_and_sentences.append(sentence)
    return chapters_and_sentences

semantic_splitter = SemanticSplitterNodeParser.from_defaults(
    sentence_splitter=custom_sentence_and_chapter_splitter,
)

This custom splitter ensures that the end of one chapter and the start of the next are not combined into a single node by respecting chapter boundaries when chunking the document.

By adjusting the DocugamiReader class and implementing a custom sentence splitter, you can effectively chunk documents according to chapter numbers, ensuring that headings and their content are kept together and chapter boundaries are respected.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@justinzyw
Copy link
Author

justinzyw commented Apr 5, 2024

@dosu-bot. thanks for the suggestion. unfortunately Semantic splitter says that it only works with English language, which does not fulfill my needs to adapt multiple lanaguages. I also noticed that there is a MarkdownNodeParser and a HTMLNodeParser. what if I change my documents into html or md format? can it give me control the chunking by explicitly setting heading levels?

@logan-markewich
Copy link
Collaborator

@justinzyw you'd have to have a way of identifying where chapter 1, chapter 2, etc. starts and ends, which is non-trivial

@justinzyw
Copy link
Author

@justinzyw you'd have to have a way of identifying where chapter 1, chapter 2, etc. starts and ends, which is non-trivial

true. I found that I could use md format to ensure the breaks in between chapters. trival but works. Thanks for your suggestion.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants