Machine Translation

This project is a part of an end-to-end machine translation pipeline that will accept English text as input and return the French translation.

Content

Machine Translation

Description

This project takes any english text and converts it to sequences of integers based on a big enough french and english vocabularies and pass it to a model that returns a probability distribution over possible translations with accuracy > 97%.

Dataset

In this project we will be using WMT, The most common datasets used for machine translation.

Install Prerequisites

This project requires Python 3 and the following Python libraries installed:

Network Architecture

After tokenizing the text and make all pre-processes to it we pass it to a Word Embedding layer then to 2 Bidirectional LSTM with 256 units then TimeDistributed layer with a softmax activation function to produce probability distribution.

Final Model Code

After training this model for 10 epochs we were able to get accuracy of 98% over both the training and validation sets.

def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
   
    learning_rate=5e-3
    
    model=Sequential()
    model.add(Embedding(english_vocab_size,256,
                      input_length=input_shape[1]))

    model.add(Bidirectional(LSTM(256),))

    model.add(RepeatVector(output_sequence_length))

    model.add(Bidirectional(LSTM(256,return_sequences=True)))

    model.add(TimeDistributed(Dense(french_vocab_size,
                              activation='softmax')))
    
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    
    print(model.summary())
    return model

you can find all model trials in the notebook

Authors

Ahmed Abd-Elbakey Ghonem - Github

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
data		data
images		images
.DS_Store		.DS_Store
CODEOWNERS		CODEOWNERS
README.md		README.md
helper.py		helper.py
machine_translation.ipynb		machine_translation.ipynb
project_tests.py		project_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Translation

Content

Description

Dataset

Install Prerequisites

Network Architecture

Final Model Code

Authors

Contributing

About

Releases

Packages

Contributors 2

Languages

3ba2ii/Machine-Translation-NLP

Folders and files

Latest commit

History

Repository files navigation

Machine Translation

Content

Description

Dataset

Install Prerequisites

Network Architecture

Final Model Code

Authors

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages