[Question]: how to control the chunking of documents according to chapter number etc? #12605

justinzyw · 2024-04-05T11:13:00Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

by analysing the docstore, I realized a problem that it might happen that the end of a previous chapter and the start of the next chapter were chunked into one node. This results in the fact that llm misunderstood the content of the node, tried to mix the two parts together and generated incorrect responses.

Is there a way to resolve this problem please?

dosubot · 2024-04-05T11:13:15Z

Hey @justinzyw, good to see you back! Looks like we've got another puzzle on our hands. I'm on it and will get back to you with some insights shortly.

justinzyw · 2024-04-05T11:14:13Z

wonder whether there is a way to chunk according to chapter numbers? say .... chunk at end of chapter 1, chapter 2, etc...

dosubot · 2024-04-05T11:23:10Z

To address the issue of chunking documents by chapter, you can leverage the hierarchical structure of the document in the DocugamiReader class within the llama_index/readers/docugami/base.py file. By default, the parent_hierarchy_levels attribute is set to 0, meaning it does not include parent chunk information in the chunk metadata. Adjusting this attribute to include parent information allows for the creation of chunks that are aware of their hierarchical relationship within the document, thus keeping headings and their content together.

Here's a conceptual adjustment to the DocugamiReader class:

class DocugamiReader(BaseReader):
    parent_hierarchy_levels: int = 1  # Adjusted to include parent chunk information

    def _build_framework_chunk(self, dg_chunk: Chunk) -> Document:
        if dg_chunk.parent:
            framework_parent_chunk = self._build_framework_chunk(dg_chunk.parent)
            parent_id = framework_parent_chunk.metadata.get(ID_KEY)
            if parent_id:
                metadata[self.parent_id_key] = parent_id

This ensures each chunk is aware of its parent, allowing for the grouping of headings with their corresponding content.

Additionally, consider implementing a custom sentence splitter that identifies chapter breaks, to be used with the SemanticSplitterNodeParser. Here's an example:

def custom_sentence_and_chapter_splitter(text: str) -> List[str]:
    chapter_pattern = re.compile(r'\bChapter \d+\b', re.IGNORECASE)
    sentences = split_by_sentence_tokenizer(text)
    chapters_and_sentences = []
    for sentence in sentences:
        if chapter_pattern.search(sentence):
            chapters_and_sentences.append(sentence)
        else:
            chapters_and_sentences.append(sentence)
    return chapters_and_sentences

semantic_splitter = SemanticSplitterNodeParser.from_defaults(
    sentence_splitter=custom_sentence_and_chapter_splitter,
)

This custom splitter ensures that the end of one chapter and the start of the next are not combined into a single node by respecting chapter boundaries when chunking the document.

By adjusting the DocugamiReader class and implementing a custom sentence splitter, you can effectively chunk documents according to chapter numbers, ensuring that headings and their content are kept together and chapter boundaries are respected.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

justinzyw · 2024-04-05T11:54:45Z

@dosu-bot. thanks for the suggestion. unfortunately Semantic splitter says that it only works with English language, which does not fulfill my needs to adapt multiple lanaguages. I also noticed that there is a MarkdownNodeParser and a HTMLNodeParser. what if I change my documents into html or md format? can it give me control the chunking by explicitly setting heading levels?

logan-markewich · 2024-04-05T15:03:34Z

@justinzyw you'd have to have a way of identifying where chapter 1, chapter 2, etc. starts and ends, which is non-trivial

justinzyw · 2024-04-06T02:14:18Z

@justinzyw you'd have to have a way of identifying where chapter 1, chapter 2, etc. starts and ends, which is non-trivial

true. I found that I could use md format to ensure the breaks in between chapters. trival but works. Thanks for your suggestion.

justinzyw added the question Further information is requested label Apr 5, 2024

justinzyw closed this as completed Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: how to control the chunking of documents according to chapter number etc? #12605

[Question]: how to control the chunking of documents according to chapter number etc? #12605

justinzyw commented Apr 5, 2024

dosubot bot commented Apr 5, 2024

justinzyw commented Apr 5, 2024

dosubot bot commented Apr 5, 2024

About Dosu

justinzyw commented Apr 5, 2024 •

edited

Loading

logan-markewich commented Apr 5, 2024

justinzyw commented Apr 6, 2024

[Question]: how to control the chunking of documents according to chapter number etc? #12605

[Question]: how to control the chunking of documents according to chapter number etc? #12605

Comments

justinzyw commented Apr 5, 2024

Question Validation

Question

dosubot bot commented Apr 5, 2024

justinzyw commented Apr 5, 2024

dosubot bot commented Apr 5, 2024

Sources

About Dosu

justinzyw commented Apr 5, 2024 • edited Loading

logan-markewich commented Apr 5, 2024

justinzyw commented Apr 6, 2024

justinzyw commented Apr 5, 2024 •

edited

Loading