The 4 Biggest Open Problems in NLP

This is the second post based on the Frontiers of NLP session at the Deep Learning Indaba 2018. It discusses 4 major open problems in NLP.

The 4 Biggest Open Problems in NLP

This post discusses 4 major open problems in NLP based on an expert survey and a panel discussion at the Deep Learning Indaba.

This is the second blog post in a two-part series. The series expands on the Frontiers of Natural Language Processing session organized by Herman Kamper, Stephan Gouws, and me at the Deep Learning Indaba 2018. Slides of the entire session can be found here. The first post discussed major recent advances in NLP focusing on neural network-based methods. This post discusses major open problems in NLP. You can find a recording of the panel discussion this post is based on here.

In the weeks leading up to the Indaba, we asked NLP experts a number of simple but big questions. Based on the responses, we identified the four problems that were mentioned most often:

  1. Natural language understanding
  2. NLP for low-resource scenarios
  3. Reasoning about large or multiple documents
  4. Datasets, problems, and evaluation

We discussed these problems during a panel discussion. This article is mostly based on the responses from our experts (which are well worth reading) and thoughts of my fellow panel members Jade Abbott, Stephan Gouws, Omoju Miller, and Bernardt Duvenhage. I will aim to provide context around some of the arguments, for anyone interested in learning more.

Natural language understanding

I think the biggest open problems are all related to natural language
understanding. [...] we should develop systems that read and
understand text the way a person does
, by forming a representation of
the world of the text, with the agents, objects, settings, and the
relationships, goals, desires, and beliefs of the agents, and everything else
that humans create to understand a piece of text. Until we can do that, all of our progress is in improving our systems’ ability to do pattern matching.
– Kevin Gimpel

Many experts in our survey argued that the problem of natural language understanding (NLU) is central as it is a prerequisite for many tasks such as natural language generation (NLG). The consensus was that none of our current models exhibit 'real' understanding of natural language.

Innate biases vs. learning from scratch   A key question is what biases and structure should we build explicitly into our models to get closer to NLU. Similar ideas were discussed at the Generalization workshop at NAACL 2018, which Ana Marasovic reviewed for The Gradient and I reviewed here. Many responses in our survey mentioned that models should incorporate common sense. In addition, dialogue systems (and chat bots) were mentioned several times.

On the other hand, for reinforcement learning, David Silver argued that you would ultimately want the model to learn everything by itself, including the algorithm, features, and predictions. Many of our experts took the opposite view, arguing that you should actually build in some understanding in your model. What should be learned and what should be hard-wired into the model was also explored in the debate between Yann LeCun and Christopher Manning in February 2018.

Program synthesis   Omoju argued that incorporating understanding is difficult as long as we do not understand the mechanisms that actually underly NLU and how to evaluate them. She argued that we might want to take ideas from program synthesis and automatically learn programs based on high-level specifications instead. Ideas like this are related to neural module networks and neural programmer-interpreters.

She also suggested we should look back to approaches and frameworks that were originally developed in the 80s and 90s, such as FrameNet and merge these with statistical approaches. This should help us infer common sense-properties of objects, such as whether a car is a vehicle, has handles, etc. Inferring such common sense knowledge has also been a focus of recent datasets in NLP.

Embodied learning   Stephan argued that we should use the information in available structured sources and knowledge bases such as Wikidata. He noted that humans learn language through experience and interaction, by being embodied in an environment. One could argue that there exists a single learning algorithm that if used with an agent embedded in a sufficiently rich environment, with an appropriate reward structure, could learn NLU from the ground up. However, the compute for such an environment would be tremendous. For comparison, AlphaGo required a huge infrastructure to solve a well-defined board game. The creation of a general-purpose algorithm that can continue to learn is related to lifelong learning and to general problem solvers.

While many people think that we are headed in the direction of embodied learning, we should thus not underestimate the infrastructure and compute that would be required for a full embodied agent. In light of this, waiting for a full-fledged embodied agent to learn language seems ill-advised. However, we can take steps that will bring us closer to this extreme, such as grounded language learning in simulated environments, incorporating interaction, or leveraging multimodal data.

Emotion   Towards the end of the session, Omoju argued that it will be very difficult to incorporate a human element relating to emotion into embodied agents. Emotion, however, is very relevant to a deeper understanding of language. On the other hand, we might not need agents that actually possess human emotions. Stephan stated that the Turing test, after all, is defined as mimicry and sociopaths—while having no emotions—can fool people into thinking they do. We should thus be able to find solutions that do not need to be embodied and do not have emotions, but understand the emotions of people and help us solve our problems. Indeed, sensor-based emotion recognition systems have continuously improved—and we have also seen improvements in textual emotion detection systems.

Cognitive and neuroscience   An audience member asked how much knowledge of neuroscience and cognitive science are we leveraging and building into our models. Knowledge of neuroscience and cognitive science can be great for inspiration and used as a guideline to shape your thinking. As an example, several models have sought to imitate humans' ability to think fast and slow. AI and neuroscience are complementary in many directions, as Surya Ganguli illustrates in this post.

Omoju recommended to take inspiration from theories of cognitive science, such as the cognitive development theories by Piaget and Vygotsky. She also urged everyone to pursue interdisciplinary work. This sentiment was echoed by other experts. For instance, Felix Hill recommended to go to cognitive science conferences.

NLP for low-resource scenarios

Dealing with low-data settings (low-resource languages, dialects (including social media text "dialects"), domains, etc.).  This is not a completely "open" problem in that there are already a lot of promising ideas out there; but we still don't have a universal solution to this universal problem.
– Karen Livescu

The second topic we explored was generalisation beyond the training data in low-resource scenarios. Given the setting of the Indaba, a natural focus was low-resource languages. The first question focused on whether it is necessary to develop specialised NLP tools for specific languages, or it is enough to work on general NLP.

Universal language model   Bernardt argued that there are universal commonalities between languages that could be exploited by a universal language model. The challenge then is to obtain enough data and compute to train such a language model. This is closely related to recent efforts to train a cross-lingual Transformer language model and cross-lingual sentence embeddings.

Cross-lingual representations   Stephan remarked that not enough people are working on low-resource languages. There are 1,250-2,100 languages in Africa alone, most of which have received scarce attention from the NLP community. The question of specialized tools also depends on the NLP task that is being tackled. The main issue with current models is sample efficiency. Cross-lingual word embeddings are sample-efficient as they only require word translation pairs or even only monolingual data. They align word embedding spaces sufficiently well to do coarse-grained tasks like topic classification, but don't allow for more fine-grained tasks such as machine translation. Recent efforts nevertheless show that these embeddings form an important building lock for unsupervised machine translation.

More complex models for higher-level tasks such as question answering on the other hand require thousands of training examples for learning. Transferring tasks that require actual natural language understanding from high-resource to low-resource languages is still very challenging. With the development of cross-lingual datasets for such tasks, such as XNLI, the development of strong cross-lingual models for more reasoning tasks should hopefully become easier.

Benefits and impact   Another question enquired—given that there is inherently only small amounts of text available for under-resourced languages—whether the benefits of NLP in such settings will also be limited. Stephan vehemently disagreed, reminding us that as ML and NLP practitioners, we typically tend to view problems in an information theoretic way, e.g. as maximizing the likelihood of our data or improving a benchmark. Taking a step back, the actual reason we work on NLP problems is to build systems that break down barriers. We want to build models that enable people to read news that was not written in their language, ask questions about their health when they don't have access to a doctor, etc.

Given the potential impact, building systems for low-resource languages is in fact one of the most important areas to work on. While one low-resource language may not have a lot of data, there is a long tail of low-resource languages; most people on this planet in fact speak a language that is in the low-resource regime. We thus really need to find a way to get our systems to work in this setting.

Jade opined that it is almost ironic that as a community we have been focusing on languages with a lot of data as these are the languages that are well taught around the world. The languages we should really focus on are the low-resource languages where not much data is available. The great thing about the Indaba is that people are working and making progress on such low-resource languages. Given the scarcity of data, even simple systems such as bag-of-words will have a large real-world impact. Etienne Barnard, one of the audience members, noted that he observed a different effect in real-world speech processing: Users were often more motivated to use a system in English if it works for their dialect compared to using a system in their own language.

Incentives and skills   Another audience member remarked that people are incentivized to work on highly visible benchmarks, such as English-to-German machine translation, but incentives are missing for working on low-resource languages. Stephan suggested that incentives exist in the form of unsolved problems. However, skills are not available in the right demographics to address these problems. What we should focus on is to teach skills like machine translation in order to empower people to solve these problems. Academic progress unfortunately doesn't necessarily relate to low-resource languages. However, if cross-lingual benchmarks become more pervasive, then this should also lead to more progress on low-resource languages.

Data availability   Jade finally argued that a big issue is that there are no datasets available for low-resource languages, such as languages spoken in Africa. If we create datasets and make them easily available, such as hosting them on openAFRICA, that would incentivize people and lower the barrier to entry. It is often sufficient to make available test data in multiple languages, as this will allow us to evaluate cross-lingual models and track progress. Another data source is the South African Centre for Digital Language Resources (SADiLaR), which provides resources for many of the languages spoken in South Africa.

Reasoning about large or multiple documents

Representing large contexts efficiently. Our current models are mostly based on recurrent neural networks, which cannot represent longer contexts well. [...] The stream of work on graph-inspired RNNs is potentially promising, though has only seen modest improvements and has not been widely adopted due to them being much less straight-forward to train than a vanilla RNN.
– Isabelle Augenstein

Another big open problem is reasoning about large or multiple documents. The recent NarrativeQA dataset is a good example of a benchmark for this setting. Reasoning with large contexts is closely related to NLU and requires scaling up our current systems dramatically, until they can read entire books and movie scripts. A key question here—that we did not have time to discuss during the session—is whether we need better models or just train on more data.

Endeavours such as OpenAI Five show that current models can do a lot if they are scaled up to work with a lot more data and a lot more compute. With sufficient amounts of data, our current models might similarly do better with larger contexts. The problem is that supervision with large documents is scarce and expensive to obtain. Similar to language modelling and skip-thoughts, we could imagine a document-level unsupervised task that requires predicting the next paragraph or chapter of a book or deciding which chapter comes next. However, this objective is likely too sample-inefficient to enable learning of useful representations.

A more useful direction thus seems to be to develop methods that can represent context more effectively and are better able to keep track of relevant information while reading a document. Multi-document summarization and multi-document question answering are steps in this direction. Similarly, we can build on language models with improved memory and lifelong learning capabilities.

Datasets, problems, and evaluation

Perhaps the biggest problem is to properly define the problems themselves. And by properly defining a problem, I mean building datasets and evaluation procedures that are appropriate to measure our progress towards concrete goals. Things would be easier if we could reduce everything to Kaggle style competitions!
– Mikel Artetxe

We did not have much time to discuss problems with our current benchmarks and evaluation settings but you will find many relevant responses in our survey. The final question asked what the most important NLP problems are that should be tackled for societies in Africa. Jade replied that the most important issue is to solve the low-resource problem. Particularly being able to use translation in education to enable people to access whatever they want to know in their own language is tremendously important.

The session concluded with general advice from our experts on other questions that we had asked them, such as "What, if anything, has led the field in the wrong direction?" and "What advice would you give a postgraduate student in NLP starting their project now?" You can find responses to all questions in the survey.

Deep Learning Indaba 2019

If you are interested in working on low-resource languages, consider attending the Deep Learning Indaba 2019, which takes place in Nairobi, Kenya from 25-31 August 2019.

Credit: Title image text is from the NarrativeQA dataset. The image is from the slides of the NLP session.