The Open Parallel Corpus
- website: http://opus.nlpl.eu
- github: https://github.com/Helsinki-NLP/OPUS
- contact: opus-project AT helsinki DOT fi
This repository contains information about the released parallel corpora and derived data
sets in OPUS, the open collection of parallel corpora. Each sub-directory in corpus/
corresponds to one specific resource with released versions and data sets
according to the following format corpus/name/version
.
Tools for finding and processing OPUS data sets:
- OpusTools - Python library and tools for accessing and processing OPUS data [pip]
- OpusTools-perl - Perl scripts for processing OPUS data
- OPUS-API - API for searching OPUS resources [live API]
- OpusFilter - a toolbox for filtering and compiling parallel corpora [doc] [pip]
- OPUS-search - online search in OPUS data [Europarl v7] [Europarl v3] [OpenSubtitles v1] [OpenSubtitles v2018] [EUconst]
- OPUS-dic - online dictionary based on word alignments
Managing OPUS:
- OPUS-ingest - recipes for ingesting/importing data to OPUS
- OPUS-website - OPUS website and corpus sample files
- OPUS-admin - scripts and recipes for admin tasks (restricted access)
- OPUS-repository - parallel data management system [frontend] [backend] [live demo]
- OPUS-ISA - experimental sentence alignment interface [live demo]
Machine translation with OPUS-MT:
- Opus-MT - OPUS-MT web service setup
- OPUS-MT-train - scripts and recipes for training OPUS-MT models
- OPUS-translator - OPUS-MT web interface [live demo]
- OPUS-MT-testsets - a collection of MT benchmarks
- OPUS-MT-leaderboard - OPUS-MT evaluation scores and leaderboards [live demo]
- OPUS-MT-map - interactive map of OPUS-MT language coverage [live demo]
- OPUS-MT-app - desktop app for local translation with OPUS-MT (fork of translateLocally)
- OPUS-CAT - OPUS-MT integration in CAT tools
Please, cite the following LREC 2012 paper when using OPUS and also acknowledge corpus-specific references as specified in the resource-specific information and documentation!
@InProceedings{TIEDEMANN12.463,
author = {Jörg Tiedemann},
title = {Parallel Data, Tools and Interfaces in {OPUS}},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
}
- mtdata - a library for retrieving MT datasets
- LanguageCodes - Perl modules for managing language codes
- eflomal - a tool for efficient word alignment with pre-trained priors from OPUS
- the Tatoeba translation challenge - a comprehensive MT dataset compiled from OPUS and Tatoeba
- wiki back-translations - over a billion automatically translated sentences
- OPUS-SPM - pre-trained sentence piece models from OPUS data
OPUS and related resources and tools have been partially supported by various projects such as
- LetsMT! - A Platform for Online Sharing of Training Data and Building User Tailored Machine Translation (EU ICT PSP)
- MeMAD - Methods for Managing Audiovisual Data (EU Horizon 2020)
- NLPL - the Nordic Language Processing Laboritory (neic)
- EOSC-nordic - the European Open Science Cloud within the Nordic and Baltic countries (EU Horizon 2020)
- ELG - the European Language Grid (EU Horizon 2020)
- FoTran - Found in Translation (EU ERC)
- HPLT - High-Performance Language Technologies (EU Horizon)
OPUS is hosted by CSC, the IT Center for Science in Finland, and heavily draws on the HPC resources provided by CSC. OPUS is also part of NLPL, the Nordic Language Processing Laboratory. Last but not least, OPUS would not be possible without the various contributions from the community including aligned data sets and tools to create and process parallel corpora.