Skip to content

Language data chef (recipes for acquiring, cleaning & prepping data for use with LLMs)

Uche Ogbuji edited this page Jul 23, 2023 · 1 revision

There might be no greater engineering necessity when working with LLMs at any stage than getting and prepping high quality data. There are many complex projects for abstracting this process away, but we believe this sort of thing is too case-specific for that. Instead, we offer a book of quick recipes you can build on in your own LLM integration code.

HTML from the Web to Markdown

!pip install httpx html2text
import html2text
import httpx

with httpx.Client(verify=False) as client:
    resp = client.get('https://en.wikipedia.org/wiki/Igbo_people')
    html = resp.content.decode(resp.encoding)

text = html2text.html2text(html)

For options & such: https://github.com/Alir3z4/html2text