ACL 2021 Highlights

This post discusses my highlights of ACL 2021, including challenges in benchmarking, machine translation, model understanding, and multilingual NLP.

ACL 2021 Highlights

ACL 2021 took place virtually from 1–6 August 2021. Here are my highlights from the conference:

NLP benchmarking is broken

Many talks and papers made reference to the current state of NLP benchmarking, which has seen existing benchmarks largely outpaced by rapidly improving pre-trained models.

My favourite resources on this topic from the conference are:

I've also written a longer blog post that provides a broader overview of different perspectives, challenges, and potential solutions to improve benchmarking in NLP.

NLP is all about pre-trained Transformers

This should come as no surprise but it's still interesting to see that among the 14 "hot" topics of 2021 (see below) were five pre-trained models (BERT, RoBERTa, BART, GPT-2, XLM-R) and one general "Language models" topic. These models are essentially all variants of the same Transformer architecture.

Roberto Navigli discussing "hot" topics at ACL 2018 vs ACL 2021 during the opening session

This serves as a useful reminder that the community is overfitting to a particular setting and that it may be worthwhile to look beyond the standard Transformer model (see my recent newsletter for some inspiration).

There were a few papers that sought to improve the general Transformer architecture for processing long and short documents respectively:

Machine translation

Machine translation, similar to past years, was one of the most popular tracks of the conference, just behind the general ML track in terms of the number of submissions as can be seen below.

The distribution of submissions per track (with 3350 submissions in total).

3 of the top 6 papers are on MT:

  • Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers. Marie et al. investigate how credible the evaluation in a large number of papers actually is. They find that almost all papers used BLEU; 74.3% exclusively used BLEU. 108 new MT metrics have been proposed in the last decade but none are used consistently. Unsurprisingly, most papers do not perform statistical significance testing. An increasing number of papers copy scores from previous work. Sometimes scores are reported using different variants of BLEU script and are therefore not comparable. They provide the following guidelines for MT evaluation: Don’t use BLEU exclusively, do statistical significance testing, do not copy numbers from prior work, compare systems using the same pre-processed data. More recently, in 2022, Benjamin Marie has written additional posts where he analysed shortcomings in the MT evaluation of prominent papers.
  • Neural Machine Translation with Monolingual Translation Memory. Cai et al. combine neural networks with a non-parametric memory.
  • Vocabulary Learning via Optimal Transport for Neural Machine Translation. Xu et al. frame vocabulary learning as optimal transport. They propose to use marginal utility as a measure for a good vocabulary.

Within machine translation, there were a couple of papers that I particularly enjoyed:

There were also some papers that focused on machine translation for low-resource language varieties without using parallel data for those languages:

Understanding models

Gaining a better understanding of the behaviour of current models was another theme of the conference, with three out of the six outstanding papers falling in this area:

  • Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. This paper analyses fine-tuning through the lens of intrinsic dimension and shows that common pre-trained models have a very low intrinsic dimension. They also show that pre-training implicitly minimises the intrinsic dimension and that larger models tend to have lower intrinsic dimension. Intrinsic dimension is a very relevant concept for the evaluation and design of efficient pre-trained models, which we covered in an EMNP 2022 tutorial.
  • Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering. This paper investigates the failure of active learning on VQA. The authors observe that the acquired examples are collective outliers, i.e., groups of examples that are hard or impossible for current models. Removing such hard outliers makes things easier for active learning.
  • UnNatural Language Inference. This paper changes the word order of NLI sentences to investigate if models "know syntax". They find that state-of-the-art NLI models are largely invariant to word order changes. They observe that some distributional information (POS neighbourhood) may be useful for performing well in the permuted setup. Unsurprisingly, human annotators struggle on the permuted sentences.

I also enjoyed the following two papers that developed new methods and frameworks for understanding model behaviour:

Cross-lingual transfer and multilingual NLP

Beyond machine translation, I enjoyed the following papers on cross-lingual transfer and multilingual NLP:

Challenges in natural language generation

Natural language generation (NLG) is one of the most challenging settings for NLP. Some papers I enjoyed focused on some of the challenges of different NLG applications:

Virtual conference notes

Lastly, I want to share some brief notes to add to the ongoing conversation around a format for virtual conferences. I was mainly looking forward to attending the poster sessions, as these are usually my highlight of conferences (in addition to the social interactions). There were two in my timezone. Each poster session consisted of a large number of tracks being presented at the same time, which left considerably less time to explore and talk to poster presenters of other areas.

Specific posters were hard to find as virtual poster spaces did not show the name nor authors of a poster. In addition, space between posters was small so that audio between posters with large crowds would get mixed.

In the future, I would really like to see poster sessions that are:

  • larger in number and covering only a few tracks each;
  • spread out throughout the day and timezones;
  • easy to navigate and with enough space between posters.

Two other things that would have improved my virtual conference experience were a) a chat system that is more seamlessly integrated into the conference platform and b) a tighter integration between the conference platform and the ACL anthology (linking to the papers in the anthology would be nice).

Attending the Zoom sessions for paper presentations went well and I enjoyed watching the recordings of other talks and keynotes.