-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Question]: how to control the chunking of documents according to chapter number etc? #12605
Comments
Hey @justinzyw, good to see you back! Looks like we've got another puzzle on our hands. I'm on it and will get back to you with some insights shortly. |
wonder whether there is a way to chunk according to chapter numbers? say .... chunk at end of chapter 1, chapter 2, etc... |
To address the issue of chunking documents by chapter, you can leverage the hierarchical structure of the document in the Here's a conceptual adjustment to the class DocugamiReader(BaseReader):
parent_hierarchy_levels: int = 1 # Adjusted to include parent chunk information
def _build_framework_chunk(self, dg_chunk: Chunk) -> Document:
if dg_chunk.parent:
framework_parent_chunk = self._build_framework_chunk(dg_chunk.parent)
parent_id = framework_parent_chunk.metadata.get(ID_KEY)
if parent_id:
metadata[self.parent_id_key] = parent_id This ensures each chunk is aware of its parent, allowing for the grouping of headings with their corresponding content. Additionally, consider implementing a custom sentence splitter that identifies chapter breaks, to be used with the def custom_sentence_and_chapter_splitter(text: str) -> List[str]:
chapter_pattern = re.compile(r'\bChapter \d+\b', re.IGNORECASE)
sentences = split_by_sentence_tokenizer(text)
chapters_and_sentences = []
for sentence in sentences:
if chapter_pattern.search(sentence):
chapters_and_sentences.append(sentence)
else:
chapters_and_sentences.append(sentence)
return chapters_and_sentences
semantic_splitter = SemanticSplitterNodeParser.from_defaults(
sentence_splitter=custom_sentence_and_chapter_splitter,
) This custom splitter ensures that the end of one chapter and the start of the next are not combined into a single node by respecting chapter boundaries when chunking the document. By adjusting the Sources
|
@dosu-bot. thanks for the suggestion. unfortunately Semantic splitter says that it only works with English language, which does not fulfill my needs to adapt multiple lanaguages. I also noticed that there is a MarkdownNodeParser and a HTMLNodeParser. what if I change my documents into html or md format? can it give me control the chunking by explicitly setting heading levels? |
@justinzyw you'd have to have a way of identifying where chapter 1, chapter 2, etc. starts and ends, which is non-trivial |
true. I found that I could use md format to ensure the breaks in between chapters. trival but works. Thanks for your suggestion. |
Question Validation
Question
by analysing the docstore, I realized a problem that it might happen that the end of a previous chapter and the start of the next chapter were chunked into one node. This results in the fact that llm misunderstood the content of the node, tried to mix the two parts together and generated incorrect responses.
Is there a way to resolve this problem please?
The text was updated successfully, but these errors were encountered: