Project Proposal

Machine translation under atypical use.

Machine translation is the process of converting a text from a source language to target language by the use of machine learning algorithms. The inherent complexity of language means that having a rule based approach to solve the problem would be tedious and inefficient, hence the need for machine learning/Neural networks based algorithms arises. Neural Machine learning methods have found use in modelling the ambiguity and flexibility of human language, but they still lack predictability and are trained on corpora based on 'formal use' and perform poorly on text 'in the wild'. Hence in this project we will try to build methods/models that are trained on standard corpora but also perform well on distributionally shifted database i.e. text with the use of profanities, punctuation errors, grammatical mistakes, slangs etc.

Dataset

For the training we will use standard dataset like the WMT’20 English-Russian corpus, English-Russian Newstest’19 and the corpus of news data collected from GlobalVoices News service. The text in these is mostly used formally. To check our evaluation on shifted dataset we will use Reddit corpus prepared for the WMT’19 robustness challenge. These dataset is annotated by expert human translators and supplied by the shifts challenge team (NeurIPS 2021). This dataset is tagged with the following anomalies:- Punctuation anomalies, Capitalisation anomalies, Fluency anomalies, Slang anomalies, Emoji anomalies and Tags anomalies. The method of evaluation for the robustness and uncertainty in the prediction will be through area under error - retention curves.

Previous methods and baselines.

For the baseline we use an ensemble of 3 transformers-big models trained on the WMT'20 En-Ru Corpus. Some of the previous works on robust models have been done of small scale image classification problems but not on multi - modal problems like translation where multiple correct sentences are possible for one input sentence. Such approaches have been extended to structured prediction tasks like ours. They can be characterised by Ensemble and sampling based, Prior networks and temperature scaling.

Our approach and Schedule.

Upto midway: Do statistical analysis on the dataset, try existing ensemble based methods and existing techniques on robust learning. After midway: Try to include ELMO word embeddings, implement a paper on max-min robust optimisation, assessment and optimisation of the models. Our objective will be to improve baseline performance, try to come with additional measure for uncertainty determination and get to the evaluation leaderboards.

Relevant Papers

  1. Wang, Y., Wu, L., Xia, Y., Qin, T., Zhai, C., & Liu, T.-Y. (2020). Transductive Ensemble Learning for Neural Machine Translation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 6291-6298.
  2. Paul Michel, Tatsunori B. Hashimoto, & Graham Neubig (2021). Modeling the Second Player in Distributionally Robust Optimization. ArXiv, abs/2103.10282.
  3. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proc. of NAACL.
  4. Malinin, A., Band, N., Ganshin, Chesnokov, G., Gal, Gales, M., Noskov, A., Ploskonosov, A., Prokhorenkova, L., Provilkov, I., Raina, V., Raina, V., Roginskiy, D., Shmatova, M., Tigar, P., & Yangel, B. (2021). Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks. arXiv preprint arXiv:2107.07455.

Project Final

Synopsis of papers


Upon reading of the papers only few were found to be useful in our use case.

1. Malinin, A., Band, N., Ganshin, Chesnokov, G., Gal, Gales, M., Noskov, A., Ploskonosov, A., Prokhorenkova, L., Provilkov, I., Raina, V., Raina, V., Roginskiy, D., Shmatova, M., Tigar, P., & Yangel, B. (2021). Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks. arXiv preprint arXiv:2107.07455 .


##2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (p./pp. 5998--6008), .

##3. Malinin, A., & Gales, M. (2020). Uncertainty Estimation in Autoregressive Structured Prediction. arXiv preprint arXiv:2002.07650.

4. Wang, Y., Wu, L., Xia, Y., Qin, T., Zhai, C., & Liu, T.-Y. (2020). Transductive Ensemble Learning for Neural Machine Translation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 6291-6298.


Misc details

After using a moses tokeniser, BPE codes must be learnt. This is to learn subword tokens to enrich our vocabulary. For making an inference, first the tokenised sentences must be converted into fairseq binary format for faster processing For choosing the best hypothesis from such a large space of all possible sentences of finite length we must use beam search*

In beam search we keep track of certain number of hypothesis and then the token that maximises the likelihood is chosen.

Revised Plan

Uncertainty/Robustness in deep Learning

Results/Code: What has been done so far:

  1. Baseline transformer - inference generated for valid and test set which gave log likelihood probabilities for different tokens
  2. Baseline uncertainty - author implementation not working, own implementation but will require more time as some details were overlooked on initial reading.
  3. Making interactive translations and visualising attention vectors were implemented.
  4. Varying dropout rates and method for uncertainty estimates explored on synthetic dataset. The Colab file will be updated with the codes/results and the notebook used for learning pytorch and fairseq on this same repository after trying to fix the errors(which can be fixed) before 19th.