Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix: MarkdownChunker, retain subsection headers #323

Merged

Conversation

alexchao
Copy link
Contributor

@alexchao alexchao commented Nov 8, 2024

Update MarkdownChunker to retain level 2 and level 3 Markdown headers in the chunk content for better retrieval. Previous logic was only grabbing the top-level header.

Closes #322


The LangChain MarkdownHeaderTextSplitter produces Document objects that carry around the containing header content as metadata values. There are possibly a number of ways to use this header metadata to improve the retrieval of relevant chunks, but I'm not changing the overall approach; this change just ensures that level 2 and level 3 headers are retained when this header text is added to the chunk content.

⚠️ As others have noted: I tried to follow the contribution guidelines, but the dev branch seems to pull in some other commits, so this pull request is directed to main. Also, I see there are no automated tests for the project yet. I'm happy to write tests, but didn't want to add test code for which there's no automated pipeline, so I'll skip for now.

I've tested my change manually:

doc = Document(content="...")
await MarkdownChunker().chunk(config={}, documents=[doc])
print([chunk.content for chunk in doc.chunks])

... and through the Verba UI on a local deployment by importing the Markdown doc mentioned in the issue.

Update MarkdownChunker to retain level 2 and level 3 Markdown headers in
the chunk content for better retrieval.

Closes weaviate#322
@weaviate-git-bot
Copy link

To avoid any confusion in the future about your contribution to Weaviate, we work with a Contributor License Agreement. If you agree, you can simply add a comment to this PR that you agree with the CLA so that we can merge.

beep boop - the Weaviate bot 👋🤖

PS:
Are you already a member of the Weaviate Slack channel?

@alexchao
Copy link
Contributor Author

alexchao commented Nov 9, 2024

Agree with the CLA.

@thomashacker thomashacker changed the base branch from main to dev December 6, 2024 15:09
@thomashacker
Copy link
Collaborator

Hey @alexchao, sorry for the late reply! Thanks a lot for this great PR 🚀 I'll look into it
Yeah, automated testing is still in development 🥲

Copy link
Collaborator

@thomashacker thomashacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me thanks a lot!

@thomashacker thomashacker merged commit 2547c09 into weaviate:dev Dec 6, 2024
@alexchao
Copy link
Contributor Author

alexchao commented Dec 6, 2024

@thomashacker great, thank you!

@alexchao alexchao deleted the fix/322-add-subheaders-to-md-chunks branch December 6, 2024 15:50
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MarkdownChunker drops subsection headers
3 participants