Skip to content
This repository has been archived by the owner on Sep 27, 2024. It is now read-only.

Latest commit

 

History

History
71 lines (54 loc) · 5.35 KB

LLM Dataset.md

File metadata and controls

71 lines (54 loc) · 5.35 KB

LLM Datasets: Datasets for Training LLM


last commit Repo stars

Introduction to Large Language Models 📄

Large language models (LLMs) are a type of artificial intelligence (AI) that are trained on massive amounts of text data. This data can include books, articles, code, and websites. LLMs learn the patterns and structures of language from this data, which allows them to perform a variety of tasks, including:

  • Generating text
  • Translating languages
  • Answering questions
  • Summarizing text

LLMs are still under development, but they have the potential to revolutionize the way we interact with computers. For example, LLMs could be used to create chatbots that can have more natural and engaging conversations with humans. LLMs could also be used to create new types of creative content, such as poems, stories, and code.

General Open Access Datasets for Alignment 🟢

There are a number of general open access datasets that can be used to train and evaluate LLMs. These datasets include:

  • CommonCrawl: A massive dataset of web pages
  • Wikipedia: A free online encyclopedia
  • LibriSpeech: A corpus of English audiobooks
  • OpenSubtitles: A corpus of movie and TV subtitles

Type Tags 🏷️

The following type tags can be used to classify LLMs:

  • Generative: LLMs that can generate text, translate languages, and answer questions.
  • Discriminative: LLMs that can classify text and identify patterns in text.
  • Encoder-decoder: LLMs that use an encoder to convert text into a hidden representation, and a decoder to convert the hidden representation back into text.
  • Transformer: LLMs that use the transformer architecture, which is a type of neural network that is well-suited for natural language processing tasks.

I hope this is helpful!

Dataset Statistics 📊

  • Number of examples: 100 million
  • Number of tokens: 1 trillion
  • Dataset splits:
    • Train: 80% 🚂
    • Validation: 10% 🧪
    • Test: 10% 🏁
  • Data formats:
    • Text: Plain text files, one example per line. 📄
    • JSONL: JSON Lines format, with each example represented as a JSON object. 🏷️
    • TFRecord: TensorFlow Record format. ⚙️

LLM Datasources

LLM Source Description Number of examples Number of tokens Dataset splits Data formats
CommonCrawl A massive dataset of web pages. 200 billion 400 trillion Train, validation, test Text, JSONL, TFRecord
Wikipedia A free online encyclopedia. 3 billion 6 trillion Train, validation, test Text, JSONL, TFRecord
LibriSpeech A corpus of English audiobooks. 1 million 10 billion Train, validation, test Text, JSONL, TFRecord
OpenSubtitles A corpus of movie and TV subtitles. 2 billion 20 trillion Train, validation, test Text, JSONL, TFRecord
CodeSearchNet A corpus of code snippets. 100 million 1 trillion Train, validation, test Text, JSONL, TFRecord
Pile A dataset of text and code from a variety of sources, including books, articles, code, and websites. 800 billion 80 trillion Train, validation, test Text, JSONL, TFRecord
LAION-2B-en A dataset of text and images. 2 billion text-image pairs 40 trillion Train, validation, test Text, image
P3 A dataset of prompts and datasets across 46 languages & 16 NLP tasks. 1 million prompts 10 billion Train, validation, test Text, JSONL
xP3 A dataset of prompts and datasets across 46 languages & 16 NLP tasks. 100 million prompts 1 trillion Train, validation, test Text, JSONL
OpenAssistant Conversations Dataset A dataset of conversations between users and open assistant systems. 10 million conversations 1 billion Train, validation, test Text, JSONL
RedPajama A dataset of text and code, created by replicating the LLaMA training dataset. 1 trillion 100 trillion Train, validation, test Text, code
ROOTS A multilingual dataset of text from 59 languages. 100 billion 1 trillion Train, validation, test Text, JSONL, TFRecord
AI21 Stories A dataset of human-written stories, used for the AI21 Stories competition. 10,000 10 million Train, validation, test Text
Natural Questions A dataset of real-world questions and their corresponding answers, used for training and evaluating machine learning models for natural language processing. 1 million 10 billion Train, validation, test Text