Machine translation using natural language processing

. Machine Translation is the translation of text or speech by a computer with no human involvement. It is a popular topic in research with different methods being created, like rule-based, statistical and example-based machine translation. Neural networks have made a leap forward to machine translation. This paper discusses the building of a deep neural network that functions as a part of end-to-end translation pipeline. The completed pipeline would accept English text as input and return the French Translation. The project has three main parts which are preprocessing, creation of models and Running the model on English Text.


Introduction
Machine translation (MT) is a domain of computational linguistics, which explores the use of software to translate text or speech from language to another. Machine translation simply performs substitution of words in one language for words in other, but that may not assure good translation. A more sophisticated method which is also a growing field used to address the issue of recognition of multiple phrases is with statistical and neural technique. In this translation of text from one language to another, there is no human involvement and it is the machine which performs the process of conversion. There are three types of machine translation system-rules based, statistical and neural. Rule based is a conventional method which is a combination of language and grammar and the support of dictionaries. This work focusses on building an end to end machine translation pipeline. We have discussed multiple existing architectures and finally proposed a hybrid model to achieve a more powerful system for machine translation from English to French.

Part 1: Dataset
The first step in the implementation of any Deep Learning project is the investigation of DataSet that would be used to train the pipeline as well as evaluate the pipeline. Most commonly, the datasets used for Machine Translation are from WMT (the website that is dedicated to research in statistical machine translation). Due to time constraints, the dataset contains small vocabulary. This facilitates the training in a reasonable time.
The data located in a filepath is loaded. The file contains English sentences with their French translations. Load the data from these files. Print the first two lines from each file. In the figure above, the punctuations have been delimited using spaces. The conversion to lowercase has been done. The complexity of vocabulary determines the complexity of the problem. The complexity of the dataset is given below.

Part 2: Pre-process the data
The text data is not used as input to the model. The text is converted into sequences of integers using the pre-process methods which are Tokenization and addition of Padding.

Tokenize (Implementation)
The neural network has to comprehend the input data for it to predict on text data. The network can understand ASCII characters. The text data like "dog" is a sequence of ASCII character encodings. The operations performed on network include multiplication and addition operations, hence, text is converted to numbers.
Each character and word is assigned a number, conventionally called character IDs and word IDs, respectively. A word level model used word ids that generate text predictions for every word. This has lower complexity compared to character level model.
Keras has a utility class known as Tokenizer. This class allows to vectorize a text corpus, by turning each text into either a sequence of integers or into a vector where the coefficient for each token could be binary, based on word count. Run tokenize on sample data and sow the output for debugging.

Padding ( Implementation)
When batching the sequence of word ids together, each sequence needs to be the same length. Since the sentences are dynamic in length, padding is done at the end of sentences to make them the same length. The function used to perform padding is present in Keras as pad_sequences. It takes three parameters which is x-List of sequences, length-Length to pad to and return padded numpy array of sequences. The output of padding is given below. Fig. 4. Output of padding.

Pre-process Pipeline
The Keras preprocess function is used to create preprocess pipeline. It takes two parameters x-Feature List of sequences, y-Label List of sentences and return a tuple of preprocessed x, preprocessed y, x tokenizer, y tokenizer. The Keras's sparse_categorical_crossentropy function requires labels to be in three dimensions.

Models
In this work, we have experimented on various neural network architectures which include simple Recurrent Neural Network, simple RNN with embedding, a Bidirectional RNN and an Encoder-Decoder RNN. We have compared the functioning of these four architectures. The final architecture was built to outperform all the four models discussed above.
The neural network needs to give the French Translation but it would be translating the input to word ids. The function logits_to_text will bridge the gab between the logits from the neural network to the French translation. It uses tokenizer to turn logits from a neural network to the French translation.

Simple RNN Implementation
Build and train a basic RNN on x and y. The parameters include a Tuple of input shape, length of output sequence, number of unique English words in the dataset and the number of unique French words in the dataset. It returns a built Keras model.
After this, train the neural network and reduce the batch size to 100 from 1024 and print the prediction. Prior to training the neural network, the layers are built and reshape the input to work with a basic Recurrent neural network.
The information about the layers and parameters are given below in the picture Fig. 5. Architecture of a Simple RNN.

RNN with Embedding
An embedding is a vector representation of the word that is close to similar words in ndimensional space, where the n represents the size of the embedding vectors. Instead of turning the words into ids, word embeddings are used. The input is reshaped before training the neural network. The passed index length is increased by 1 to avoid index error. After this, the batch size is reduced to 100. The number of epochs taken is 10 and the validation split is 0.2. The prediction is printed. Fig. 6. Architecture of RNN with Embedding.

Bidirectional RNNs
The RNN outputs depend on present inputs and the memory of the previous outputs. But

Encoder-Decoder Models
The model consists is made up of an encoder and decoder. The encoder creates a matrix representation of the sentence. The decoder takes this matrix as input and predicts the translation as output. While implementing the encoder reverse the input sequence order for improved accuracy. Design an adapter to fit 2-dimensional output to required 3dimensional input shape [samples, time steps, features]. The steps involved in encoder model are as follows: 1. Reverse input sequence order for improved accuracy.
2. Adapter to fit 2D output to required 3D input shape [samples, time steps, features] The decoder is implemented after this. The neural network is trained and prediction is printed. Fig. 8. Architecture of encoder-decoder model.

Proposed model
In the proposed model, first embedding is done after which a bidirectional encoder is added.
Add an adapter to fit 2D output to required GRU 3D input shape [samples, time steps, features].
Finally implement the decoder. The architecture is given below.

Results
The final sentence translations are returned in French with an appreciable accuracy of 96.71% The samples of French translations that were returned is given below;

Conclusions
Thus, we have built a deep neural network that functions as part of an end-to-end machine translation pipeline to convert English text as input and return the French translation. The proposed network performed far better than the other architectures discussed previously in terms of the validation loss. The accuracy was 96.71%. A random seed was fixed for reproducibility, a 3-fold cross validation test harness was determined and the model was compiled and evaluated.