corpus
: corpus data corpus build scripts/makefilesdoc
: (rudimentary) documentationeflomal
: recipes for creating eflomal priorsincoming
: notes about incoming data setstemplates
: template recipes for importing additional data setstools
: some additional scripts and tools (mostly obsolete)
releases
: released data files (submodule OPUS)public_html
: websites and data sample files (submodule OPUS-website)admin
: administration stuff (non-public git repository OPUS-admin)cwb
: Corpus Workbench index files and registers (generated)
- python packages: opustools, polyglot, fast-mosestokenizer
- Perl modules: OpusTools, Uplug and dependencies
- subalign (for subtitle conversion and alignment)
- pdftotext, recode, tidy, pigz, GNU parallel and other common GNU/Unix tools
- Moses and eflomal (optional for word alignment and phrase table extraction)
- the corpus work bench (CWB) and cwb Perl modules (optional for cwb index generation)
- optional: yasa (our fork from https://github.com/Helsinki-NLP/yasa)
git clone git@github.com:Helsinki-NLP/OPUS-ingest.git
cd OPUS-ingest
git submodule update --init --recursive --remote
make install
The last step will most likely fail. Check error messages and the Makefile for details.
NOTE: The documentation belowe requires serious updates!
- make build scripts more readable
- consistent language codes
- get rid of hard-coded paths to tools and make the repo more general and less depending on specific environments (like the one on puhti/CSC)
- better documentation (as always)
- more efficient pre-processing
- consistent pre-processing (UD-based?)
- more frequent corpus updates (Tatoeba, wikimedia and other frequently changing corpora)
- streamline corpus creation, processing and maintenance procedures
- improve integration/updates of OPUS-API and website updates
- …