elrc-scrape

Scrape ELRC-SHARE for corpora.

As a machine translation person, you just want the all the parallel corpora with the least effort. Something like OPUS.

There's also parallel data available for download at ELRC-SHARE under public domain, public sector information, creative commons, and a few other licenses. Unfortunately, their site appears to require clicking for each corpus. And honestly, it's probably not worth your time to download 44 parallel sentences.

While there is an official client, it appears to require a login which in turn requires an affiliation with ELRC or a CEF-funded project. So I made a scraper of sorts.

Here's how to make a TSV of public parallel corpora in ELRC-SHARE:

# Download JSON files.
for ((i=0;i<5000;++i)); do
  if [ ! -s $i.json ]; then
    echo wget -O $i.json https://www.elrc-share.eu/repository/export_json/$i/
  fi
done |parallel
# Download zip files
./parse.py |parallel
# Generate TSV with l1, l2, num, short_name, name, info, download, post (string for HTTP POST, empty if not required), licenses (space separated), in_paths (tab separated if multiple files)
./parse.py >elrc_share.tsv

ELRC uses sequence numbers. Many of these will yield error 500. That's expected. If you don't get a series of 500s at the end, ELRC has more than 5000 records. Increase the number and edit NUM_MAX in parse.py

The plan is for all the corpora to be listed in the mtdata tool for automatic downloading.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
README.md		README.md
download_json.sh		download_json.sh
parse.py		parse.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

elrc-scrape

About

Releases

Packages

Contributors 2

Languages

kpu/elrc-scrape

Folders and files

Latest commit

History

Repository files navigation

elrc-scrape

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages