Table of Contents:
Extracts HTML content from a JSON file to produce a Markdown file. Leverages similarity threshold to remove redundant content.
Building retrieval augmented generation AI applications can be a lengthy process. While there are web crawlers to collect content, the post processing of this content is equally important for accurate and helpful generation.
This library was built specifically to augment the context-curator project by further automating the document creation process.
- Installation
-
To have access to the package in your local environment (your working directory), clone the repository using git:
git clone https://github.com/daethyra/context-converter.git
-
To install via pip, run:
pip install context-converter
Optional: Run jina_embeddings.py
to preemptively download the embeddings model.
-
Navigate into the
context-converter
folder:cd context-converter
-
Place a JSON file of HTML content into the same folder.
-
Run
python3 main.py
Your output file will be created in the same folder.
You can tweak the similarity threshold and more to help yourself curate what you want.
i. In main.py, you can set the following parameters to optimize your results:
main.py
- chunk_size: The size of the chunk to be processed. The default value is 256.
- You can find speed tests here.
ii. In converter.py, you can set the following parameters to optimize your results:
converter.py
similarity.item()
: The similarity threshold. The default value is 0.868899. Only similarity values above the threshold are removed, meaning a higher threshold removes less content. A lower threshold removes more content.batch_size
: Proccess embeddings for the given lines using batch processing. The default value is 16, which has proved to be faster than higher values, up to 256. Speed test results.