Bitextor generates the final parallel corpora in multiple formats. These files will be placed in permanentDir
folder and will have the following naming convention: {lang1}-{lang2}.{prefix}.gz
, where {prefix}
corresponds to a descriptor of the corresponding format.
The files that will be always generated (regardless of configuration) are {lang1}-{lang2}.raw.gz
and {lang1}-{lang2}.sent.gz
. These files come in Moses format, i.e. tab-separated columns.
-
{lang1}-{lang2}.raw.gz
: parallel corpus that contains every aligned sentences, has no deduplication and the sentences are not filtered.This file contains columns added by different optional modules/features: paragraph identification, deferred, Bifixer and Bicleaner. In case some of these are not enabled, the corresponding columns will be omitted. The possible fields that may appear in this file are (in this order):
src_url trg_url src_text trg_text aligner_score
- default columnssrc_url
andtrg_url
are source documents of the sentencessrc_text
andtrg_text
form a sentence pair inlang1
andlang2
aligner_score
is the score given by the sentence aligner (bleualign or hunalign)
src_paragraph_id trg_paragraph_id
- paragraph identification data- initial position of the sentence in the paragraph, and initial position of the paragraph in the document
src_deferred_hash trg_deferred_hash
- deferred sentence hashes- may be used to reconstruct the original corpus using Deferred crawling reconstructor
bifixer_hash bifixer_score
- Bifixer outputbifixer_hash
tags duplicate or near-duplicate sentencesbifixer_score
rates quality of duplicate or near-duplicate sentences
bicleaner_score
- Bicleaner classifer output
This file comes accompanied by the corresponding statistics file
{lang1}-{lang2}.stats.raw
, which provides information the size of the corpus in MB and in number of tokens. -
{lang1}-{lang2}.sent.gz
: parallel corpus after running all of the steps described above, plus filtering according to the specified Bicleaner threshold, adding ELRC metrics, and sorted to have the duplicate or (near-duplicate) sentences together. This file will have all of the columns fromraw
files, plus 3 new ones corresponding to ELRC metrics.url1 url2 sent1 sent2 aligner_score
- default columnspara1 para2
- paragraph identification datachecksum1 checksum2
- deferred sentence checksumsbifixer_hash bifixer_score
- Bifixer outputbicleaner_score
- Bicleaner classifier outputlength_ratio num_tokens_src num_tokens_trg
- ELRC fieldsnum_tokens_src
is the number of tokens in source sentencenum_tokens_trg
is the number of tokens in target sentencelength_ratio
is the ratio between the two (source divided by target)
Bitextor may also generate a TMX version of the corpus, if it is enabled in config via tmx: true
. The generated file is:
{lang1}-{lang2}.not-deduped.tmx.gz
will have the same content as{lang1}-{lang2}.sent.gz
, but in TMX format.
Bitextor may also perform the deduplication of the generated corpus (enabled via deduped: true
), and generate the following files that will contain only unique sentence pairs:
-
{lang1}-{lang2}.deduped.tmx.gz
: deduplicated TMX corpus, which will contain a list of URLs for the sentence pairs that were found in multiple websites. -
{lang1}-{lang2}.deduped.txt.gz
: deduplicated tab-separated corpus, with the same columns as{lang1}-{lang2}.sent.gz
; in this case each sentence will only contain one source documentCorresponding statistics file
{lang1}-{lang2}.stats.deduped
will be generetaed and it will have information about the size of the corpus in MB, the number of unique sentence pairs and the number of tokens on each size.
If Biroamer is enabled biroamer: true
, Bitextor will also produce a ROAMed version of the corpus, according to Biroamer config. Biroamer will produce either of these files, depending on tmx
and deduped
parameters:
{lang1}-{lang2}.not-deduped.roamed.tmx.gz
: ROAMed version of the corpus deduplicated in TMX format, that will be generated iftmx: true
{lang1}-{lang2}.deduped.roamed.tmx.gz
: ROAMed version of the deduplicated corpus in TMX format, that will be generated ifdeduped: true