Datasets

This page documents the official and community datasets featuring tldr-pages.

Official Datasets

We provide and generate datasets in formats like CSV, XML, JSON and TMX (Translation Memory eXchange) using https://github.com/tldr-pages/tldr-translation-pairs-gen tool. And can be found under its latest release. These artifacts are also available with the below sources:

OPUS tldr-pages Dataset (TMX format)
- OPUS is a public dataset of translated resources on the web. All translations are derived from freely available and openly licensed sources, so the translations themselves are safe to use with minimal restrictions.
- These datasets are helpful for a variety of applications such as research and machine learning.
- A notable project that uses the OPUS corpora is LibreTranslate (which is powered by argos-translate).
Kaggle Translation Pairs Dataset (CSV format)
- Kaggle is a data science competition platform and online community of data scientists and machine learning practitioners under Google LLC.
- It is popular among Students and Data Scientists.
- This multilingual text dataset contains paired strings mapping various localized tldr-pages.