Skip to content

kpu/elrc-scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 

Repository files navigation

elrc-scrape

Scrape ELRC-SHARE for corpora.

As a machine translation person, you just want the all the parallel corpora with the least effort. Something like OPUS.

There's also parallel data available for download at ELRC-SHARE under public domain, public sector information, creative commons, and a few other licenses. Unfortunately, their site appears to require clicking for each corpus. And honestly, it's probably not worth your time to download 44 parallel sentences.

While there is an official client, it appears to require a login which in turn requires an affiliation with ELRC or a CEF-funded project. So I made a scraper of sorts.

Here's how to make a TSV of public parallel corpora in ELRC-SHARE:

# Download JSON files.
for ((i=0;i<5000;++i)); do
  if [ ! -s $i.json ]; then
    echo wget -O $i.json https://www.elrc-share.eu/repository/export_json/$i/
  fi
done |parallel
# Download zip files
./parse.py |parallel
# Generate TSV with l1, l2, num, short_name, name, info, download, post (string for HTTP POST, empty if not required), licenses (space separated), in_paths (tab separated if multiple files)
./parse.py >elrc_share.tsv

ELRC uses sequence numbers. Many of these will yield error 500. That's expected. If you don't get a series of 500s at the end, ELRC has more than 5000 records. Increase the number and edit NUM_MAX in parse.py

The plan is for all the corpora to be listed in the mtdata tool for automatic downloading.

About

Scrape ELRC-SHARE for corpora

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published