Naïve-EAMT: Naïve Entity Aware Machine Translation Framework

This project aims to provide a configurable machine translation architecture. Through this architecture one could combine various implementations of Named Entity Recognition and Disambiguation tools with Neural Machine Translation to form an end-to-end entity aware machine translation RESTful service.

It comes already integrated with the following tools:

Type	Component	ID	Supported Lang.	Link	Maximum input sequence length
NER	Babelscape NER	babelscape_ner	de, en, es, fr, it, nl, pl, pt, ru	https://huggingface.co/Babelscape/wikineural-multilingual-ner	512 tokens (Based on BERT)
	Flair NER	flair_ner	de, en, es, nl	https://github.com/flairNLP/flair	512 tokens (Based on BERT)
	Davlan NER	davlan_ner	ar, de, en, es, fr, it, lv, nl, pt, zh	https://huggingface.co/Davlan/bert-base-multilingual-cased-ner-hrl	512 tokens (Based on mBERT)
	Spacy NER	spacy_ner	de, en, es, fr, it, nl, pl, pt, ru	https://github.com/explosion/spacy-models/releases/tag/xx_ent_wiki_sm-3.4.0	1 million character (spacy's default max_length)
EL	MAG	mag_el	en, de, fr, es, it, ja, nl	https://github.com/dice-group/AGDISTIS/wiki/5---New-Capabilities---MAG
EL	mGenre	mgenre_el	Supports 105 languages (Table 10: https://arxiv.org/pdf/2103.12528.pdf)	https://github.com/facebookresearch/GENRE	512 tokens (Based on mBART)
MT	Libre Translate	libre_mt	ar, az, zh, cs, da, nl, en, eo, fi, fr, de, el, he, hi, hu, id, ga, it, ja, ko, fa, pl, pt, ru, sk, es, sv, tr, uk	https://github.com/LibreTranslate/LibreTranslate	512 tokens ()
	Opus MT*	opus_mt	Supports 203 languages for translation to English (https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/models)	https://github.com/Helsinki-NLP/Opus-MT	512 tokens (https://huggingface.co/Helsinki-NLP)
	NLLB MT**	nllb_mt	Supports 196 languages	https://github.com/facebookresearch/fairseq/tree/nllb/#multilingual-translation-models	1024 tokens (as per `tokenizer.model_max_length` for "facebook/nllb-200-distilled-600M" on huggingface)
	MBart MT**	mbart_mt	Supports 53 languages	https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt	mBART - 512 tokens, mBART50 - 1024 tokens (https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt)

Language code ref: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

[*] currently the application only downloads some (de, es, fr, nl, ru, zh, it, pt, lt, ja) of the supported languages' data for Opus MT. For further language support, please download the data and modify the configuration. This should be done before setup is executed, otherwise you will have to rebuild the Opus docker image with the right configuration.

[**] for NLLB and MBart MT, currently the application only allows (de, es, fr, pt, ru, it, nl, uk, be, zh, ja, ba, lt, hy) and (de, es, fr, pt, ru, it, nl, uk, zh, ja, lt) respectively. Edit the lang_code_map in the component file to extend support for further languages.

Configuration

The application uses a configuration file configuration.ini to allow users to form pipelines based upon their combination of components.

A sample pipeline configuration would look like this:

# unique pipeline section title
[EAMT Pipeline 2]
# pipeline name (can be non-unique as well)
name = babelscape-mgenre-libre
# ordered list of component ids in the pipeline
components = ["babelscape_ner", "mgenre_el", "libre_mt"]
# Path name (without /) that will be used to query this pipeline at localhost:6100/<path>
path = pipeline_bmgl

The pipeline config allows to join any existing components together as long as they follow I/O formatting rules. The component IDs can be found in the table above.

Important: The application only initiates the components mentioned in the config pipelines to be memory efficient. To save on memory, please comment out the config for the non-required pipelines.

Docker Setup

Before running the setup script, please make sure you have the proper docker permissions. You can test it using the following command (without sudo):

docker run hello-world

If that works, then you can proceed normally, otherwise, you must perform the steps to manage docker as a non-root user.

Also, please make sure that your docker-compose is version 1.28.0 or above to support docker profiles.

Download and setup the data using the following command (needs 150GB free storage, can take a few hours to finish):

./setup_data.sh

Then, build the docker image:

docker build -t naive-eamt .

Finally, to start use the start script:

./start_docker_containers.sh

To stop, use the stop script:

./stop_docker_containers.sh

Logs

The logs are maintained in the logs/neamt.log file

Query

The configured pipelines can be queried through HTTP POST request, like:

curl --location --request POST 'http://localhost:6100/pipeline_bmgo' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'query=Ist Hawaii der Geburtsort von Obama'

The output format would depend upon the last component in the queried pipeline.

The application also accepts custom pipelines submitted along with the HTTP request. However, the components must be pre-initialized.

The query with custom pipeline request would look something like this:

curl --location --request GET 'http://localhost:6100/custom-pipeline' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'query=Ist Hawaii der Geburtsort von Obama' \
--data-urlencode 'components=spacy_ner, mag_el, opus_mt'

Optional Parameters: The NEAMT application also allows its users to configure two optional parameters:

target_lang (default: en): language code (e.g en, de, fr etc.) for the target language for the machine translation
placeholder (default: 00): string value that will be used as a placeholder in concatenation with a number
replace_before (default: False): boolean value, if set to True, the application will replace placeholders with target labels before sending for machine translation

Sample query with optional parameters:

curl --location --request POST 'http://porque.cs.upb.de:6100/custom-pipeline' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'components=babelscape_ner, mgenre_el, libre_mt' \
--data-urlencode 'query=Wer ist älter, Lionel Messi oder Christiano Ronaldo?' \
--data-urlencode 'placeholder=wd:res' \
--data-urlencode 'replace_before=False' \
--data-urlencode 'target_lang=ru'

Caching

To enable caching, set redis_enabled in configuration.ini to yes. Then, start_docker_containers.sh should start up redis automatically. If Redis is running separately, set the address of Redis instance in the option redis_host. Outputs of components in pipelines are cached independently and would be reused as long as the input to the component stays exactly the same.

Customized Components

Component I/O Formatting

NER: For the components that strictly perform the task of named entity recognition, the expected input is a dictionary containing text in natural language (en,de,fr,es). The output should be a dictionary containing the string and information of annotated entities. Following is an example:

Input:

{
  "text": "Ist Hawaii der Geburtsort von Obama?"
}

Output:

{
  "text": "Ist Hawaii der Geburtsort von Obama?",
  "lang": "de",
  "ent_mentions": [
      {
          "start": 4,
          "end": 10,
          "surface_form": "Hawaii"
      },
      {
          "start": 30,
          "end": 35,
          "surface_form": "Obama"
      }
  ]
}

EL: For the components performing only the entity linking task, the expected input is the output from the NER. The output should be the same dictionary with additional information about the entity mentions. Carrying on with the example from above, following is a sample output:

Output:

{
  "text": "Ist Hawaii der Geburtsort von Obama?",
  "lang": "de",
  "kb": "wd",
  "ent_mentions": [
      {
          "start": 3,
          "end": 9,
          "surface_form": "Hawaii",
          "link": "Q68740"
      },
      {
          "start": 29,
          "end": 34,
          "surface_form": "Obama",
          "link": "Q76"
      }
  ]
}

MT: For the components performing the machine translation task, the expected input is the output from EL task. The output is the translated natural language text in English.

Output: Is Hawaii the birth place of Barack Obama?

Additionally, you can make use of the functions in placeholder_util.py to replace the entities with placeholder tokens and vice versa. The framework also provides dummy NER(no_ner) and EL(no_el) components that would format the data according to the listed I/O format but would not perform any NER/EL tasks. The dummy components can be used to build MT only pipelines.

Combination: If your custom component is a combination of consecutive components in the pipeline, then you must follow the input/output format accordingly. Your combined component must comply to the input format for the point of entrance and output format for the point of exit.

How to add a new component?

To add your own custom component you can follow these steps:

Add your dependencies to requirements.txt
Create a new python file in the component/ directory
Your python file must have a process_input function that will receive the input as per its placement in the pipeline (I/O Formatting)
Make sure to load the necessary resources (models, data etc.) onto the memory only inside the __init__ function, this helps keeps the application memory efficient
Import your component in the start.py and add an ID to Class mapping inside comp_map
Create a new pipeline with your component in configuration.ini

Additional steps for Docker based components:

If your component has a docker image, then create a service using your docker image in docker-compose.yml
Assign a new docker profile to your component to ensure that it's not loaded unnecessarily
Modify the logic in find_profiles.py to accommodate your component

How to pass a custom parameter as input?

By default the framework passes on all the extra parameters in the input to the components. However, one has to make sure the that custom parameter name does match the preexisting parameters. To avoid conflict with the existing parameters, it would be a good practice to add a custom prefix to your parameter (e.g orgname_custompara1).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Naïve-EAMT: Naïve Entity Aware Machine Translation Framework

Configuration

Docker Setup

Logs

Query

Caching

Customized Components

Component I/O Formatting

How to add a new component?

How to pass a custom parameter as input?

Files

README.md

Latest commit

History

README.md

File metadata and controls

Naïve-EAMT: Naïve Entity Aware Machine Translation Framework

Configuration

Docker Setup

Logs

Query

Caching

Customized Components

Component I/O Formatting

How to add a new component?

How to pass a custom parameter as input?