Language data chef (recipes for acquiring, cleaning & prepping data for use with LLMs)

There might be no greater engineering necessity when working with LLMs at any stage than getting and prepping high quality data. There are many complex projects for abstracting this process away, but we believe this sort of thing is too case-specific for that. Instead, we offer a book of quick recipes you can build on in your own LLM integration code.

HTML from the Web to Markdown

!pip install httpx html2text
import html2text
import httpx

with httpx.Client(verify=False) as client:
    resp = client.get('https://en.wikipedia.org/wiki/Igbo_people')
    html = resp.content.decode(resp.encoding)

text = html2text.html2text(html)

For options & such: https://github.com/Alir3z4/html2text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language data chef (recipes for acquiring, cleaning & prepping data for use with LLMs)

HTML from the Web to Markdown

Clone this wiki locally