Project Proposal

Machine translation under atypical use.

Machine translation is the process of converting a text from a source language to target language by the use of machine learning algorithms. The inherent complexity of language means that having a rule based approach to solve the problem would be tedious and inefficient, hence the need for machine learning/Neural networks based algorithms arises. Neural Machine learning methods have found use in modelling the ambiguity and flexibility of human language, but they still lack predictability and are trained on corpora based on 'formal use' and perform poorly on text 'in the wild'. Hence in this project we will try to build methods/models that are trained on standard corpora but also perform well on distributionally shifted database i.e. text with the use of profanities, punctuation errors, grammatical mistakes, slangs etc.

Dataset

For the training we will use standard dataset like the WMT’20 English-Russian corpus, English-Russian Newstest’19 and the corpus of news data collected from GlobalVoices News service. The text in these is mostly used formally. To check our evaluation on shifted dataset we will use Reddit corpus prepared for the WMT’19 robustness challenge. These dataset is annotated by expert human translators and supplied by the shifts challenge team (NeurIPS 2021). This dataset is tagged with the following anomalies:- Punctuation anomalies, Capitalisation anomalies, Fluency anomalies, Slang anomalies, Emoji anomalies and Tags anomalies. The method of evaluation for the robustness and uncertainty in the prediction will be through area under error - retention curves.

Previous methods and baselines.

For the baseline we use an ensemble of 3 transformers-big models trained on the WMT'20 En-Ru Corpus. Some of the previous works on robust models have been done of small scale image classification problems but not on multi - modal problems like translation where multiple correct sentences are possible for one input sentence. Such approaches have been extended to structured prediction tasks like ours. They can be characterised by Ensemble and sampling based, Prior networks and temperature scaling.

Our approach and Schedule.

Upto midway: Do statistical analysis on the dataset, try existing ensemble based methods and existing techniques on robust learning. After midway: Try to include ELMO word embeddings, implement a paper on max-min robust optimisation, assessment and optimisation of the models. Our objective will be to improve baseline performance, try to come with additional measure for uncertainty determination and get to the evaluation leaderboards.

Relevant Papers

Wang, Y., Wu, L., Xia, Y., Qin, T., Zhai, C., & Liu, T.-Y. (2020). Transductive Ensemble Learning for Neural Machine Translation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 6291-6298.
Paul Michel, Tatsunori B. Hashimoto, & Graham Neubig (2021). Modeling the Second Player in Distributionally Robust Optimization. ArXiv, abs/2103.10282.
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proc. of NAACL.
Malinin, A., Band, N., Ganshin, Chesnokov, G., Gal, Gales, M., Noskov, A., Ploskonosov, A., Prokhorenkova, L., Provilkov, I., Raina, V., Raina, V., Roginskiy, D., Shmatova, M., Tigar, P., & Yangel, B. (2021). Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks. arXiv preprint arXiv:2107.07455.

Project Final

Synopsis of papers

Upon reading of the papers only few were found to be useful in our use case.

1. Malinin, A., Band, N., Ganshin, Chesnokov, G., Gal, Gales, M., Noskov, A., Ploskonosov, A., Prokhorenkova, L., Provilkov, I., Raina, V., Raina, V., Roginskiy, D., Shmatova, M., Tigar, P., & Yangel, B. (2021). Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks. arXiv preprint arXiv:2107.07455 .

Recall: We want models that are robust to distributional shifts i.e they perform well on data 'in the wild' which could be very different than our training data
Additionally we want it to give reliable uncertainty estimates on its prediction so we know when to trust it.
Datasets: WMT'20en-ru, Newstest'19, WMT'19 MTNT reddit, Global voices, Shifts reddit. The authors provide the shifted dataset for exactly this purposes.
For predictive performance standard metrics can be used: BLEU, eGLEU,maxGLEU.
The expected GlEU across multiple hypothesis is given by :-* $$ eGLEU = \frac{1}{N}\sum{i=1}^{N}\sum{h=1}^{H}GLEU{i,h}.w{i,h}$$ , $$ \sum_{h=1}^{H}wh = 1 $$ while, $$maxGLEU = \frac{1}{N}\sum{i=1}^{N}maxh[GLEU{i,h}]$$
For the quality of uncertainty estimates we can use discriminatory power between 'in-domain' and 'out-domain' using AUPR and AUROC but not a good idea.
Instead authors propose area under error retention curves for joint estimates.
Authors have used an ensemble of transformer-big trained using a fork of fairseq(library for implementing many seq to seq models in pytorch). They have used uncertainty estimates from a referenced paper (both transformer model and uncertainty estimates discussed below)
The joint assessment of the authors baseline score is noted below* |Data | R-AUC | F1-AUC | ROC-AUC | |--------|---------|---------|----------| | dev |33.22±0.48| 0.428±0.003| 68.9±0.28|

##2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (p./pp. 5998--6008), .

Why do we need Transformers?
Translation is autoregressive task where given an input sequence we must return an output sequence.
RNNs are neural networks that roll out in discrete time steps and thus have been used in vec2seq, seq2vec, seq2seq tasks
They suffer from a few major drawbacks: They are slow to train, Vanishing Gradients hence do not perform good for longer sentences.
The LSTM block was introduced to solve the problem of vanishing gradients but they are even slower to train.
The problem is their sequential nature that do not allow parallelisation
The self-attention layers were added, with the idea that not all words are equally important for translation so only focus on some parts.
The context for a good translation can be found way back in a sentence hence traditional RNNs/LSTSM fail*
The attention function is a differentiable function that can be calculated given a query, K, and value as * $$ \text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d{k}}}\right) V $$ The positional encoder used in the original paper $$ \begin{aligned} P E{(p o s, 2 i)} &=\sin \left(p o s / 10000^{2 i / d{\text {model }}}\right) \ P E{(\text {pos }, 2 i+1)} &=\cos \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \end{aligned} $$
Because of their significant improvements they(transformers) have became the base for many other sequence prediction and NLP models. (BERT, GPT3)
Baseline models from fairseq were used for subsequent tasks*

##3. Malinin, A., & Gales, M. (2020). Uncertainty Estimation in Autoregressive Structured Prediction. arXiv preprint arXiv:2002.07650.

Their goal is to give unsupervised uncertainty estimates in autoregressive tasks at both token and sentence level.
The authors claim that there are two types of uncertainties Knowledge and Data
Knowledge uncertainty captures the uncertainty due to the lack of understanding of data by the model.
Data uncertainty is the intrinsic uncertainty in the data
The authors take a bayesian approach, they assume that model parameter $\mathbf{\theta}$ is a random variable with a prior $p(\mathbf{\theta})$
The predictive Posterior given the data $p(\mathbf{\theta}| \mathcal{D})$ is intractable so we take $q(\mathbf{\theta}) \approx p(\mathbf{\theta}| \mathcal{D}) $
Using entropy chain rules we can write total uncertainity = know. uncertainty + data uncertainty
Taking inspiration from this they use mutual information between y and $\theta$ to quantify ensemble diversity.
Next they use EPKL and reverse mutual information to measure model diversity/uncertainty
They explore a few properties of the three measures give monte carlo approximates to calculate them.
They also explore how different ways of ensemble prediction effects uncertainty estimates.

4. Wang, Y., Wu, L., Xia, Y., Qin, T., Zhai, C., & Liu, T.-Y. (2020). Transductive Ensemble Learning for Neural Machine Translation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 6291-6298.

While ensemble learning in effective in many scenarios it has some limitations in some cases.
When there are strong single models, ensemble based techinques do not give much improvement
Also there is a diminishing effect as no of models increase.
The authors thus propose a technique based on transductive learning to address these issues. They claim that there idea performs better based on other ensemble techniques like Knowledge distillation.
The key idea is to generate a synthetic dataset from the validation and test set and generating their translation from ensemble then fine tuning any single model on this dataset so as to minimise a loss function. Then choose the best model that performs well on the validation set. $$ y^{(t)}=\underset{w \in \mathcal{V}{t}}{\operatorname{argmax}} \frac{1}{M} \sum{m=1}^{M} \log P\left(w \mid y^{(<t)}, x ; f{m}\right) $$ The two ways to combine models are given by: $$ y=\underset{y^{\prime} \in \mathcal{T}(x)}{\operatorname{argmax}} \frac{1}{M} \sum{m=1}^{M} \log P\left(y^{\prime} \mid x ; f_{m}\right) $$ They represent the token level averaging and sentence level
Now the function to minimise is given by $$ \min \sum{(x, y) \in \mathcal{D}{v} \cup \mathcal{D}{t}}-\log P\left(y \mid x ; f{0}\right) $$

Misc details

After using a moses tokeniser, BPE codes must be learnt. This is to learn subword tokens to enrich our vocabulary. For making an inference, first the tokenised sentences must be converted into fairseq binary format for faster processing For choosing the best hypothesis from such a large space of all possible sentences of finite length we must use beam search*

In beam search we keep track of certain number of hypothesis and then the token that maximises the likelihood is chosen.

Revised Plan

By midway we were able to draw inference from the baseline model and the ensemble of models
The implementation of baseline uncertainty estimates provided by the authors ran into a lot of errors. Attempts were made to correct them but unable to pinpoint the error
So instead tried to write our own implementation but little success
Next section we focus on how can we go about improving robustness and uncertainty estimates (tested on synthetically created dataset)

Uncertainty/Robustness in deep Learning

We can motivate the decomposition of data and knowledge uncertainty on a linear model. We sample from two different regions with different variance.
Data uncertainty is high were variance is high while knowledge uncertainty is high in the region where there are fewer training examples.
We train models with/without dropout during inference to see how they perform (synthetic dataset) and also compare various dropout rates.
We train a model that has seen some examples from valid set and give uncertainty estimates using the difference in prediction from a base model.
We use the $\chi^2$ distance for comparing the two distributions

Results/Code: What has been done so far:

Baseline transformer - inference generated for valid and test set which gave log likelihood probabilities for different tokens
Baseline uncertainty - author implementation not working, own implementation but will require more time as some details were overlooked on initial reading.
Making interactive translations and visualising attention vectors were implemented.
Varying dropout rates and method for uncertainty estimates explored on synthetic dataset. The Colab file will be updated with the codes/results and the notebook used for learning pytorch and fairseq on this same repository after trying to fix the errors(which can be fixed) before 19th.