NLP Highlights is a podcast discussing interesting recent work in natural language processing. The hosts are members of the AllenNLP team at the Allen Institute for AI. This website contains a page for each episode, with transcriptions.
A note: transcribing technical discussions involving non-native speakers is challenging! We have worked with our transcribers to try to make sure the technical terms and names used are correct, but there are surely errors that we missed. To improve the transcriptions, please open a pull request on the Github page for this website.
We invited Asli Celikyilmaz for this episode to talk about evaluation of text generation systems. We discussed the challenges in evaluating generated text, and covered human and automated metrics, with a discussion of recent developments in learning metrics. We also talked about some open research questions, including the difficulties in evaluating factual correctness of generated text.
Asli Celikyilmaz is a Principal Researcher at Microsoft Research.
Link to a survey co-authored by Asli on this topic: arxiv.org/abs/2006.14799
In this episode, Diyi Yang gives us an overview of using NLP models for social applications, including understanding social relationships, processes, roles, and power. As NLP systems are getting used more and more in the real world, they additionally have increasing social impacts that must be studied. We talk about how to get started in this field, what datasets exist and are commonly used, and potential ethical issues. We additionally cover two of Diyi's recent papers, on neutralizing subjective bias in text, and on modeling persuasiveness in text. Diyi Yang is an assistant professor in the School of Interactive Computing at Georgia Tech.
In this episode, we talked about Coreference Resolution with Marta Recasens, a Research Scientist at Google. We discussed the complexity involved in resolving references in language, the simplification of the problem that the NLP community has focused on by talking about specific datasets, and the complex coreference phenomena that are not yet captured in those datasets. We also briefly talked about how coreference is handled in languages other than English, and how some of the notions we have about modeling coreference phenomena in English do not necessarily transfer to other languages. We ended the discussion by talking about large language models, and to what extent they might be good at handling coreference.
We interviewed Sameer Singh for this episode, and discussed an overview of recent work in interpreting NLP model predictions, particularly instance-level interpretations. We started out by talking about why it is important to interpret model outputs and why it is a hard problem. We then dove into the details of three kinds of interpretation techniques: attribution based methods, interpretation using influence functions, and generating explanations. Towards the end, we spent some time discussing how explanations of model behavior can be evaluated, and some limitations and potential concerns in evaluation methods. Sameer Singh is an Assistant Professor of Computer Science at the University of California, Irvine. Some of the techniques discussed in this episode have been implemented in the AllenNLP Interpret framework (details and demo here: https://allennlp.org/interpret).
We invited Yonatan Bisk to talk about grounded language understanding. We started off by discussing an overview of the topic, its research goals, and the the challenges involved. In the latter half of the conversation, we talked about ALFRED (Shridhar et al., 2019), a grounded instruction following benchmark that simulates training a robot butler. The current best models built for this benchmark perform very poorly compared to humans. We discussed why that might be, and what could be done to improve their performance. Yonatan Bisk is currently an assistant professor at Language Technologies Institute at Carnegie Mellon University. The data and the leaderboard for ALFRED can be accessed here: https://askforalfred.com/.
In this special episode, Carissa Schoenick, a program manager and communications director at AI2 interviewed Matt Gardner about AllenNLP. We chatted about the origins of AllenNLP, the early challenges in building it, and the design decisions behind the library. Given the release of AllenNLP 1.0 this week, we asked Matt what users can expect from the new release, what improvements the AllenNLP team is working on for the future versions.
We invited Marco Tulio Ribeiro, a Senior Researcher at Microsoft, to talk about evaluating NLP models using behavioral testing, a framework borrowed from Software Engineering. Marco describes three kinds of black-box tests the check whether NLP models satisfy certain necessary conditions. While breaking the standard IID assumption, this framework presents a way to evaluate whether NLP systems are ready for real-world use. We also discuss what capabilities can be tested using this framework, how one can come up with good tests, and the need for an evolving set of behavioral tests for NLP systems.
Marco’s homepage: https://homes.cs.washington.edu/~marcotcr/
We invited Fernando Pereira, a VP and Distinguished Engineer at Google, where he leads NLU and ML research, to talk about managing NLP research teams in industry. Topics we discussed include prioritizing research against product development and effective collaboration with product teams, dealing with potential research interest mismatch between individuals and the company, managing publications, hiring new researchers, and diversity and inclusion.
We invited Steven Cao to talk about his paper on multilingual alignment of contextual word embeddings. We started by discussing how multilingual transformers work in general, and then focus on Steven’s work on aligning word representations. The core idea is to start from a list of words automatically aligned from parallel corpora and to ensure the representations of the aligned words are similar to each other while not moving too far away from their original representations. We discussed the experiments on the XNLI dataset in the paper, analysis, and the decision to do the alignment at word level and compare it to other possibilities such as aligning word pieces or higher level encoded representations in transformers.
Steven Cao’s webpage: https://stevenxcao.github.io/
We invited Jon Clark from Google to talk about TyDi QA, a new question answering dataset, for this episode. The dataset contains information seeking questions in 11 languages that are typologically diverse, i.e., they differ from each other in terms of key structural and functional features. The questions in TyDiQA are information-seeking, like those in Natural Questions, which we discussed in the previous episode. In addition, TyDiQA also has questions collected in multiple languages using independent crowdsourcing pipelines, as opposed to some other multilingual QA datasets like XQuAD and MLQA where English data is translated into other languages. The dataset and the leaderboard can be accessed at https://ai.google.com/research/tydiqa.
In this episode, Tom Kwiatkowski and Michael Collins talk about Natural Questions, a benchmark for question answering research. We discuss how the dataset was collected to reflect naturally-occurring questions, the criteria used for identifying short and long answers, how this dataset differs from other QA datasets, and how easy it might be to game the benchmark with superficial processing of the text. We also contrast the holistic design in Natural Questions to deliberately targeting specific linguistic phenomena of interest when building a QA dataset.
How do we know, in a concrete quantitative sense, what a deep learning model knows about language? In this episode, Ellie Pavlick talks about two broad directions to address this question: structural and behavioral analysis of models. In structural analysis, we often train a linear classifier for some linguistic phenomenon we'd like to probe (e.g., syntactic dependencies) while using the (frozen) weights of a model pre-trained on some tasks (e.g., masked language models). What can we conclude from the results of probing experiments? What does probing tell us about the linguistic abstractions encoded in each layer of an end-to-end pre-trained model? How well does it match classical NLP pipelines? How important is it to freeze the pre-trained weights in probing experiments? In contrast, behavioral analysis evaluates a model's ability to distinguish between inputs which respect vs. violate a linguistic phenomenon using acceptability or entailment tasks, e.g., can the model predict which is more likely: "dog bites man" vs. "man bites dog"? We discuss the significance of which format to use for behavioral tasks, and how easy it is for humans to perform such tasks.
Ellie Pavlick's homepage: https://cs.brown.edu/people/epavlick/
BERT rediscovers the classical nlp pipeline , by Ian Tenney, Dipanjan Das, Ellie Pavlick https://arxiv.org/pdf/1905.05950.pdf?fbclid=IwAR3gzFibSBoDGdjqVu9Gq0mh1lDdRZa7dm42JuXXUfjG6rKZ44iHIOdV6jg
Inherent Disagreements in Human Textual Inferences by Ellie Pavlick and Tom Kwiatkowski https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00293
In this episode we invite Verena Rieser and Ondřej Dušek on to talk to us about the complexities of generating natural language when you have some kind of structured meaning representation as input. We talk about when you might want to do this, which is often is some kind of a dialog system, but also generating game summaries, and even some language modeling work. We then talk about why this is hard, which in large part is due to the difficulty of collecting data, and how to evaluate the output of these systems. We then move on to discussing the details of a major challenge that Verena and Ondřej put on, called the end-to-end natural language generation challenge (E2E NLG). This was a dataset of task-based dialog generation focused on the restaurant domain, with some very innovative data collection techniques. They held a shared task with 16 participating teams in 2017, and the data has been further used since. We talk about the methods that people used for the task, and what we can learn today from what methods have been used on this data.
Verena's website: https://sites.google.com/site/verenateresarieser/
Ondřej's website: https://tuetschek.github.io/
The E2E NLG Challenge that we talked about quite a bit: http://www.macs.hw.ac.uk/InteractionLab/E2E/
In this episode, we invite Hao Tan and Mohit Bansal to talk about multi-modal training of transformers, focusing in particular on their EMNLP 2019 paper that introduced LXMERT, a vision+language transformer. We spend the first third of the episode talking about why you might want to have multi-modal representations. We then move to the specifics of LXMERT, including the model structure, the losses that are used to encourage cross-modal representations, and the data that is used. Along the way, we mention latent alignments between images and captions, the granularity of captions, and machine translation even comes up a few times. We conclude with some speculation on the future of multi-modal representations.
Hao's website: http://www.cs.unc.edu/~airsplay/
Mohit's website: http://www.cs.unc.edu/~mbansal/
LXMERT paper: https://www.aclweb.org/anthology/D19-1514/
In this episode, we talked to Emily Bender about the ethical considerations in developing NLP models and putting them in production. Emily cited specific examples of ethical issues, and talked about the kinds of potential concerns to keep in mind, both when releasing NLP models that will be used by real people, and also while conducting NLP research. We concluded by discussing a set of open-ended questions about designing tasks, collecting data, and publishing results, that Emily has put together towards addressing these concerns. Emily M. Bender is a Professor in the Department of Linguistics and an Adjunct Professor in the Department of Computer Science and Engineering at the University of Washington. She’s active on Twitter at @emilymbender.
In this episode we invite Sudha Rao to talk about question generation. We talk about different settings where you might want to generate questions: for human testing scenarios (rare), for data augmentation (has been done a bunch for SQuAD-like tasks), for detecting missing information / asking clarification questions, for dialog uses, and others. After giving an overview of the general area, we talk about the specifics of some of Sudha's work, including her ACL 2018 best paper on ranking clarification questions using EVPI. We conclude with a discussion of evaluating question generation, which is a hard problem, and what the exciting open questions there are in this research area.
Sudha's website: https://raosudha.weebly.com/
In this episode we talked with Victor Sanh and Thomas Wolf from HuggingFace about model distillation, and DistilBERT as one example of distillation. The idea behind model distillation is compressing a large model by building a smaller model, with much fewer parameters, that approximates the output distribution of the original model, typically for increased efficiency. We discussed how model distillation was typically done previously, and then focused on the specifics of DistilBERT, including training objective, empirical results, ablations etc. We finally discussed what kinds of information you might lose when doing model distillation.
We talked to Brendan O’Connor for this episode about processing language in social media. Brendan started off by telling us about his projects that studied the linguistic and geographical patterns of African American English (AAE), and how obtaining data from Twitter made these projects possible. We then talked about how many tools built for standard English perform very poorly on AAE, and why collecting dialect-specific data is important. For the rest of the conversation, we discussed the issues involved in scraping data from social media, including ethical considerations and the biases that the data comes with.
Brendan O’Connor is an Assistant Professor at the University of Massachusetts, Amherst.
Warning: This episode contains explicit language (one swear word).
What exciting NLP research problems are involved in processing biomedical and clinical data? In this episode, we spoke with Dina Demner-Fushman, who leads NLP and IR research at the Lister Hill National Center for Biomedical Communications, part of the National Library of Medicine. We talked about processing biomedical scientific literature, understanding clinical notes, and answering consumer health questions, and the challenges involved in each of these applications. Dina listed some specific tasks and relevant data sources for NLP researchers interested in such applications, and concluded with some pointers to getting started in this field.
In this episode, Jonathan Frankle describes the lottery ticket hypothesis, a popular explanation of how over-parameterization helps in training neural networks. We discuss pruning methods used to uncover subnetworks (winning tickets) which were initialized in a particularly effective way. We also discuss patterns observed in pruned networks, stability of networks pruned at different time steps and transferring uncovered subnetworks across tasks, among other topics.
A recent paper on the topic by Frankle and Carbin, ICLR 2019: https://arxiv.org/abs/1803.03635
Jonathan Frankle’s homepage: http://www.jfrankle.com/
For our 100th episode, we invite AI2 CEO Oren Etzioni to talk to us about NLP startups. Oren has founded several successful startups, is himself an investor in startups, and helps with AI2's startup incubator. Some of our discussion topics include: What's the similarity between being a researcher and an entrepreneur? How do you transition from being a researcher to doing a startup? How do you evaluate early-stage startups? What advice would you give to a researcher who's thinking about a startup? What are some typical mistakes that you've seen startups make? Along the way, Oren predicts a that we'll see a whole generation of startup companies based on the technology underlying ELMo, BERT, etc.
For this episode, we chatted with Neil Thomas and Roshan Rao about modeling protein sequences and evaluating transfer learning methods for a set of five protein modeling tasks. Learning representations using self-supervised pretaining objectives has shown promising results in transferring to downstream tasks in protein sequence modeling, just like it has in NLP. We started off by discussing the similarities and differences between language and protein sequence data, and how the contextual embedding techniques are applicable also to protein sequences. Neil and Roshan then described a set of five benchmark tasks to assess the quality of protein embeddings (TAPE), particularly in terms of how well they capture the structural, functional, and evolutionary aspects of proteins. The results from the experiments they ran with various model architectures indicated that there was not a single best performing model across all tasks, and that there is a lot of room for future work in protein sequence modeling. Neil Thomas and Roshan Rao are PhD students at UC Berkeley.
Blog post: https://bair.berkeley.edu/blog/2019/11/04/proteins/
What function do the different attention heads serve in multi-headed attention models? In this episode, Lena describes how to use attribution methods to assess the importance and contribution of different heads in several tasks, and describes a gating mechanism to prune the number of effective heads used when combined with an auxiliary loss. Then, we discuss Lena’s work on studying the evolution of representations of individual tokens in transformers model.
Lena’s homepage: https://lena-voita.github.io/
Blog posts: https://lena-voita.github.io/posts/acl19_heads.html, https://lena-voita.github.io/posts/emnlp19_evolution.html
Papers: https://arxiv.org/abs/1905.09418, https://arxiv.org/abs/1909.01380
In this episode, we talk to Taylor Berg-Kirkpatrick about optical character recognition (OCR) on historical documents. Taylor starts off by describing some practical issues related to old scanning processes of documents that make performing OCR on them a difficult problem. Then he explains how one can build latent variable models for this data using unsupervised methods, the relative importance of various modeling choices, and summarizes how well the models do. We then take a higher level view of historical OCR as a Machine Learning problem, and discuss how it is different from other ML problems in terms of the tradeoff between learning from data and imposing constraints based on prior knowledge of the underlying process. Finally, Taylor talks about the applications of this research, and how these predictions can be of interest to historians studying the original texts.
In this episode, we chat with Luke Zettlemoyer about Question Answering as a format for crowdsourcing annotations of various semantic phenomena in text. We start by talking about QA-SRL and QAMR, two datasets that use QA pairs to annotate predicate-argument relations at the sentence level. Luke describes how this annotation scheme makes it possible to obtain annotations from non-experts, and discusses the tradeoffs involved in choosing this scheme. Then we talk about the challenges involved in using QA-based annotations for more complex phenomena like coreference. Finally, we briefly discuss the value of crowd-labeled datasets given the recent developments in pretraining large language models. Luke is an associate professor at the University of Washington and a Research Scientist at Facebook AI Research.
In this episode, we invite Yejin Choi to talk about common sense knowledge and reasoning, a growing area in NLP. We start by discussing a working definition of “common sense” and the practical utility of studying it. We then talk about some of the datasets and resources focused on studying different aspects of common sense (e.g., ReCoRD, CommonsenseQA, ATOMIC) and contrast implicit vs. explicit modeling of common sense, and what it means for downstream applications. To conclude, Yejin shares her thoughts on some of the open problems in this area and where it is headed in the future.
Yejin Choi’s homepage: https://homes.cs.washington.edu/~yejin/
In this episode, Aaron White tells us about the decompositional semantics initiative (Decomp), an attempt to re-think the prototypical approach to semantic representation and annotation. The basic idea is to decompose complex semantic classes such as ‘agent’ and ‘patient’ into simpler semantic properties such as ‘causation’ and ‘volition’, while embracing the uncertainty inherent in language by allowing annotators to choose answers such as ‘probably’ or ‘probably not’. In order to scale the collection of labeled data, each property is annotated by asking crowd workers intuitive questions about phrases in a given sentence.
Aaron White's homepage: http://aaronstevenwhite.io/
Decomp initiative page: http://decomp.io/
In this episode, we invite Alistair Johnson to discuss the main challenge in applying NLP/ML to clinical domains: the lack of data. We discuss privacy concerns, de-identification, synthesizing records, legal liabilities and data heterogeneity. We also discuss how the MIMIC dataset evolved over the years, how it is being used, and some of the under-explored ways in which it can be used.
Alistair’s homepage: http://alistairewj.github.io/
MIMIC dataset: https://mimic.physionet.org/
In this episode, we invite David Bamman to give an overview of computational humanities. We discuss examples of questions studied in computational humanities (e.g., characterizing fictionality, assessing novelty, measuring the attention given to male vs. female characters in the literature). We talk about the role NLP plays in addressing these questions and how the accuracy and biases of NLP models can influence the results. We also discuss understudied NLP tasks which can help us answer more questions in this domain such as literary scene coreference resolution and constructing a map of literature geography.
David Bamman's homepage: http://people.ischool.berkeley.edu/~dbamman/
LitBank dataset: https://github.com/dbamman/litbank
In this episode, we invite Jonathan Berant to talk about executable semantic parsing. We discuss what executable semantic parsing is and how it differs from related tasks such as semantic dependency parsing and abstract meaning representation (AMR) parsing. We talk about the main components of a semantic parser, how the formal language affects design choices in the parser, and end with a discussion of some exciting open problems in this space.
Jonathan Berant's homepage: http://www.cs.tau.ac.il/~joberant/
How is it like to do research in academia vs. industry? In this episode, we invite Jason Baldridge (UT Austin => Google) and Philip Resnik (Sun Microsystems => UMD) to discuss some of the aspects one may want to consider when planning their research careers, including flexibility, security and intellectual freedom. Perhaps most importantly, we discuss how the career choices we make influence and are influenced by the relationships we forge. Check out the Careers in NLP Panel at NAACL'19 on Monday, June 3, 2019 for further discussion.
Careers in NLP panel @ NAACL'19: https://naacl2019.org/blog/careers-panel-survey/
Jason Baldridge's homepage: http://www.jasonbaldridge.com/
Philip Resnik's homepage: http://users.umiacs.umd.edu/~resnik/
In this episode, we invite Zhou Yu to give an overview of dialogue systems. We discuss different types of dialogue systems (task-oriented vs. non-task-oriented), the main building blocks and how they relate to other research areas in NLP, how to transfer models across domains, and the different ways used to evaluate these systems. Zhou also shares her thoughts on exciting future directions such as developing dialogue methods for non-cooperative environments (e.g., to negotiate prices) and multimodal dialogue systems (e.g., using video as well as audio/text). Zhou Yu's homepage: http://zhouyu.cs.ucdavis.edu/
In this episode, we invite John Hewitt to discuss his take on how to probe word embeddings for syntactic information. The basic idea is to project word embeddings to a vector space where the L2 distance between a pair of words in a sentence approximates the number of hops between them in the dependency tree. The proposed method shows that ELMo and BERT representations, trained with no syntactic supervision, embed many of the unlabeled, undirected dependency attachments between words in the same sentence.
GitHub repository: https://github.com/john-hewitt/structural-probes
Blog post: https://nlp.stanford.edu/~johnhew/structural-probe.html
Twitter thread: https://twitter.com/johnhewtt/status/1114252302141886464
John's homepage: https://nlp.stanford.edu/~johnhew/
In this episode, Shi Feng joins us to discuss his recent work on identifying pathological behaviors of neural models for NLP tasks. Shi uses input word gradients to identify the least important word for a model's prediction, and iteratively removes that word until the model prediction changes. The reduced inputs tend to be significantly smaller than the original inputs, e.g., 2.3 words instead of 11.5 in the original in SQuAD, on average. We discuss possible interpretations of these results, and a proposed method for mitigating these pathologies.
Shi Feng's homepage: http://users.umiacs.umd.edu/~shifeng/
Joint work with Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez and Jordan Boyd-Graber
In this episode, Byron Wallace tells us about interdisciplinary work between evidence based medicine and natural language processing. We discuss extracting PICO frames from articles describing clinical trials and data available for direct and weak supervision. We also discuss automating the assessment of risks of bias in, e.g., random sequence generation, allocation containment and outcome assessment, which have been used to help domain experts who need to review hundreds of articles.
Byron Wallace's homepage: http://www.byronwallace.com/
EBM-NLP dataset: https://ebm-nlp.herokuapp.com/
MIMIC dataset: https://mimic.physionet.org/
Cochrane database of systematic reviews: https://www.cochranelibrary.com/cdsr/about-cdsr
The bioNLP workshop at ACL'19 (submission due date was extended to May 10): https://aclweb.org/aclwiki/BioNLP_Workshop
The workshop on health text mining and information analysis at EMNLP'19: https://louhi2019.fbk.eu/
Machine learning for healthcare conference: https://www.mlforhc.org/
In this episode, Charles Sutton walks us through common sources of stress for researchers and suggests coping strategies to maintain your sanity. We talk about how pursuing a research career is similar to participating in a life-long international tournament, conflating research worth and self-worth, and how freedom can be both a blessing and a curse, among other stressors one may encounter in a research career.
Charles Sutton's homepage: https://homepages.inf.ed.ac.uk/csutton/
A series of blog posts Charles wrote on this topic: http://www.theexclusive.org/tag/stress%20in%20research/
In a recent Nature paper, Lingfei Wu (Ling) suggests that smaller teams of scientists tend to do more disruptive work. In this episode, we invite Ling to discuss their results, how they define disruption and possible reasons why smaller teams may be better positioned to do disruptive work. We also touch on robustness of the disruption metric, differences between research disciplines, and sleeping beauties in science.
Lingfei Wu’s homepage: https://www.knowledgelab.org/people/detail/lingfei_wu/
Note: Lingfei is on the job market for faculty positions at the intersection of social science, computer science and communication.
In this episode, we invite Sebastian Riedel to talk about knowledge base construction (KBC). Why is it an important research area? What are the tradeoffs between using an open vs. closed schema? What are popular methods currently used, and what challenges prevent the adoption of KBC methods? We also briefly discuss the AKBC workshop and its graduation into a conference in 2019.
Sebastian Riedel's homepage: http://www.riedelcastro.org/
AKBC conference: http://www.akbc.ws/2019/
In this episode, Yoav Artzi joins us to talk about visual reasoning. We start by defining what visual reasoning is, then discuss the pros and cons of different tasks and datasets. We discuss some of the models used for visual reasoning and how they perform, before ending with open questions in this young, exciting research area.
Yoav Artzi: https://yoavartzi.com/
CLEVR dataset: https://cs.stanford.edu/people/jcjohns/clevr/
Neural module networks: https://arxiv.org/abs/1511.02799
Neural models recently resulted in large performance improvements in various NLP problems, but our understanding of what and how the models learn remains fairly limited. In this episode, Tal Linzen and Afra Alishahi talk to us about BlackboxNLP, an EMNLP’18 workshop dedicated to the analysis and interpretation of neural networks for NLP. In the workshop, computer scientists and cognitive scientists joined forces to probe and analyze neural NLP models.
BlackboxNLP 2018 website: https://blackboxnlp.github.io/2018/
BlackboxNLP 2018 proceedings: https://aclanthology.info/events/ws-2018#W18-54
BlackboxNLP 2019 website: https://blackboxnlp.github.io/
Originally used to entice fierce competitions in arcade games, leaderboards recently made their way into NLP research circles. Leaderboards could help mitigate some of the problems in how researchers run experiments and share results (e.g., accidentally overfitting models on a test set), but they also introduce new problems (e.g., breaking author anonymity in peer reviewing). In this episode, Siva Reddy joins us to talk about the good, the bad, and the ugly of using leaderboards in science. We also discuss potential solutions to address some of the outstanding problems with existing leaderboard frameworks.
Software platforms for leaderboards: http://codalab.org/ https://leaderboard.allenai.org/
In this episode, Natalie Schluter talks to us about a data-driven analysis of career progression of male vs. female researchers in NLP through the lens of mentor-mentee networks based on ~20K papers in the ACL anthology. Directed edges in the network describe a mentorship relation from the last author on a paper to the last author, and author names were annotated for gender when possible. Interesting observations include the increase of percentage of mentors (regardless of gender), and an increasing gap between the fraction of mentors who are males and females since the early 2000s. By analyzing the number of years between a researcher’s first publication and the year at which they achieve mentorship status at threshold T, defined by publishing T or more papers as a last author, Natalie also found that female researchers tend to take much longer to be mentors. Another interesting finding is that in-gender mentorship is a strong predictor of the mentee’s success in becoming mentors themselves. Finally, Natalie describes the bias preferential attachment model of Avin et al. (2015) and applies it to the gender-annotated mentor-mentee network in NLP, formally describing a glass ceiling in NLP for female researchers.
See also: Homophily and the glass ceiling effect in social networks, at ITCS 2015, by Chen Avin, Barbara Keller, Zvi Lotker, Claire Mathieu, David Peleg, and Yvonne-Anne Pignolet. www.semanticscholar.org/paper/Homophi…466478031b8ff
Apologies for the relatively poor audio quality on this one; we did our best.
Most NLP projects rely crucially on the quality of annotations used for training and evaluating models. In this episode, Matt and Ines of Explosion AI tell us how Prodigy can improve data annotation and model development workflows. Prodigy is an annotation tool implemented as a python library, and it comes with a web application and a command line interface. A developer can define input data streams and design simple annotation interfaces. Prodigy can help break down complex annotation decisions into a series of binary decisions, and it provides easy integration with spaCy models. Developers can specify how models should be modified as new annotations come in in an active learning framework.
Prodigy recipe scripts: https://github.com/explosion/prodigy-recipes
Twitter: https://twitter.com/_inesmontani https://twitter.com/honnibal
It's not uncommon for authors to be frustrated with the quality of peer reviews they receive in (NLP) conferences. In this episode, Noah A. Smith shares his advice on how to write good peer reviews. The structure Noah recommends for writing a peer review starts with a dispassionate summary of what a paper has to offer, followed by the strongest reasons the paper may be accepted, followed by the strongest reasons it may be rejected, and concludes with a list of minor, easy-to-fix problems (e.g., typos) which can be easily addressed in the camera ready. Noah stresses on the importance of thinking about how the reviews we write could demoralize (junior) researchers, and how to be precise and detailed when discussing the weaknesses of a paper to help the authors see the path forward. Other questions we discuss in this episode include: How to read a paper for reviewing purposes? How long it takes to review a paper and how many papers to review? What types of mistakes to be on the lookout for while reviewing? How to review pre-published work?
EMNLP 2018 paper by Dirk Hovy and Tommaso Fornaciari. https://www.semanticscholar.org/paper/Improving-Author-Attribute-Prediction-by-Linguistic-Hovy-Fornaciari/71aad8919c864f73108aafd8e926d44e9df51615
In this episode, Dirk Hovy talks about natural language as social phenomenon which can provide insights about those who generate it. For example, this paper uses retrofitted embeddings to improve on two tasks: predicting the gender and age group of a person based on their online reviews. In this approach, authors embeddings are first generated using Doc2Vec, then retrofitted such that authors with similar attributes are closer in the vector space. In order to estimate the retrofitted vectors for authors with unknown attributes, a linear transformation is learned which maps Doc2Vec vectors to the retrofitted vectors. Dirk also used a similar approach to encode geographic information to model regional linguistic variations, in another EMNLP 2018 paper with Christoph Purschke titled “Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting” [link: https://www.semanticscholar.org/paper/Capturing-Regional-Variation-with-Distributed-Place-Hovy-Purschke/6d9babd835d0cdaaf175f098bb4fd61fd75b1be0].
In this episode, we invite Hal Daumé to continue the discussion on reinforcement learning, focusing on how it has been used in NLP. We discuss how to reduce NLP problems into the reinforcement learning framework, and circumstances where it may or may not be useful. We discuss imitation learning, roll-in and roll-out, and how to approximate an expert with a reference policy.
Blog post by Alex Irpan titled "Deep Reinforcement Learning Doesn't Work Yet" https://www.alexirpan.com/2018/02/14/rl-hard.html
In this episode, Alex Irpan talks about limitations of current deep reinforcement learning methods and why we have a long way to go before they go mainstream. We discuss sample inefficiency, instability, the difficulty to design reward functions and overfitting to the environment. Alex concludes with a list of recommendations he found useful when training models with deep reinforcement learning.
ACL 2018 paper by Nathan Schneider, Jena D. Hwang, Vivek Srikumar, Jakob Prange, Austin Blodgett, Sarah R. Moeller, Aviram Stern, Adi Bitan, Omri Abend. In this episode, Nathan discusses how the meaning of prepositions varies, proposes a hierarchy for classifying the semantics of function words (e.g., comparison, temporal, purpose), and describes empirical results using the provided dataset for disambiguating preposition semantics. Along the way, we talk about lexicon-based semantics, multilinguality and pragmatics.
Our first episode in a new format: broader surveys of areas, instead of specific discussions on individual papers. In this episode, we talk with Jordan Boyd-Graber about question answering. Matt starts the discussion by giving five different axes on which question answering tasks vary: (1)how complex is the language in the question, (2)what is the genre of the question / nature of the question semantics, (3)what is the context or knowledge source used to answer the question, (4)how much "reasoning" is required to answer the question, and (5) what's the format of the answer? We talk about each of these in detail, giving examples from Jordan's and others' work. In the end, we conclude that "question answering" is a format to study a particular phenomenon, it is not a "phenomenon" in itself. Sometimes it's useful to pose a phenomenon you want to study as a question answering task, and sometimes it's not. During the conversation, Jordan mentioned the QANTA competition; you can find that here: http://qanta.org. We also talked about an adversarial question creation task for Quiz Bowl questions; the paper on that can be found here: https://www.semanticscholar.org/paper/Trick-Me-If-You-Can%3A-Adversarial-Writing-of-Trivia-Wallace-Boyd-Graber/11caf090fef96605d6d67c7505572b1a26796971.
ACL 2018 paper by Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, Karthik Sankaranarayanan Amrita and colleagues at IBM Research introduced a harder dataset for "reading comprehension", where you have to answer questions about a given passage of text. Amrita joins us on the podcast to talk about why a new dataset is necessary, what makes this one unique and interesting, and how well initial baseline systems perform on it. Along the way, we talk about the problems with using BLEU or ROUGE as evaluation metrics for question answering systems.
TACL 2018 paper (presented at ACL 2018) by David Jurgens, Srijan Kumar, Raine Hoover, Daniel A. McFarland, and Daniel Jurafsky David comes on the podcast to talk to us about citation frames. We discuss the dataset they created by painstakingly annotating the "citation type" for all of the citations in a large collection of papers (around 2000 citations in total), then training a classifier on that data to annotate the rest of the ACL anthology. This process itself is interesting, including how exactly the citations are classified, and we talk about this for a bit. The second half of the podcast talks about the analysis that David and colleagues did using the (automatically) annotated ACL anthology, trying to gauge how the field has changed over time.
A shared task held in conjunction with a NAACL 2018 workshop, organized by Burr Settles and collaborators at Duolingo. Burr tells us about the shared task. The goal of the task was to predict errors that a language learner would make when doing exercises on Duolingo. We talk about the details of the data, why this particular data is interesting to study for second language acquisition, what could be better about it, and what systems people used to approach this task. We also talk a bit about what you could do with a system that can predict these kinds of errors to build better language learning systems.
NAACL 2018 paper, by Rachel Rudinger, Aaron Steven White, and Benjamin Van Durme Rachel comes on to the podcast, telling us about what factuality is (did an event happen?), what datasets exist for doing this task (a few; they made a new, bigger one), and how to build models to predict factuality (turns out a vanilla biLSTM does quite well). Along the way, we have interesting discussions about how you decide what an "event" is, how you label factuality (whether something happened) on inherently uncertain text (like "I probably failed the test"), and how you might use a system that predicts factuality in some end task.
Paper by Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Sam comes on to tell us about GLUE. We talk about the motivation behind setting up a benchmark framework for natural language understanding, how the authors defined "NLU" and chose the tasks for this benchmark, a very nice diagnostic dataset that was constructed for GLUE, and what insight they gained from the experiments they've run so far. We also have some musings about the utility of general-purpose sentence vectors, and about leaderboards.
NACL 2018 paper, by Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Jieyu comes on the podcast to talk about bias in coreference resolution models. This bias makes models rely disproportionately on gender when making decisions for whether "she" refers to a noun like "secretary" or "physician". Jieyu and her co-authors show that coreference systems do not actually exhibit much bias in standard evaluation settings (OntoNotes), perhaps because there is a broad document context to aid in making coreference decisions. But they then construct a really nice diagnostic dataset that isolates simple coreference decisions, and evaluates whether the model is using common sense, grammar, or gender bias to make those decisions. This dataset shows that current models are quite biased, particularly when it comes to common sense, using gender to make incorrect coreference decisions. Jieyu then tells us about some simple methods to correct the bias without much of a drop in overall accuracy.
Also, there was a very similar paper also published at NAACL 2018 that used similar methodology and constructed a similar dataset: https://www.semanticscholar.org/paper/Gender-Bias-in-Coreference-Resolution-Rudinger-Naradowsky/be2c8b5ec0eee2f32da950db1b6cf8cc4a621f8f.
AAAI 2018 paper by Noah Weber, Niranjan Balasubramanian, and Nathanael Chambers Niranjan joins us on the podcast to tell us about his latest contribution in a line of work going back to Shank's scripts. This work tries to model sequences of events to get coherent narrative schemas, mined from large collections of text. For example, given an event like "She threw a football", you might expect future events involving catching, running, scoring, and so on. But if the event is instead "She threw a bomb", you would expect future events to involve things like explosions, damage, arrests, or other related things. We spend much of our conversation talking about why these scripts are interesting to study, and the general outline for how one might learn these scripts from text, and spend a little bit of time talking about the particular contribution of this paper, which is a better model that captures interactions among all of the arguments to an event.
Best reproduction paper at COLING 2018, by Wuwei Lan and Wei Xu. This paper takes a bunch of models for sentence pair classification (including paraphrase identification, semantic textual similarity, natural language inference / entailment, and answer sentence selection for QA) and compares all of them on all tasks. There's a very nice table in the paper showing the cross product of models and datasets, and how by looking at the original papers this table is almost empty; Wuwei and Wei fill in all of the missing values in that table with their own experiments. This is a very nice piece of work that helps us gain a broader understanding of how these models perform in diverse settings, and it's awesome that COLING explicitly asked for and rewarded this kind of paper, as it's not your typical "come look at my shiny new model!" paper. Our discussion with Wuwei and Wei covers what models and datasets the paper looked at, why the datasets can be treated similarly (and some reasons for why maybe they should be treated differently), the differences between the models that were tested, and the difficulties of reproducing someone else's model.
TACL 2018 paper by Jacob Buckman and Graham Neubig. Jacob tells us about marginalizing over latent structure in a sentence by doing a clever parameterization of a lattice with a model kind of like a tree LSTM. This lets you treat collocations as multi-word units, or allow words to have multiple senses, without having to commit to a particular segmentation or word sense disambiguation up front. We talk about how this works and what comes out. One interesting result that comes out of the sense lattice: learning word senses from a language modeling objective tends to give you senses that capture the mode of the "next word" distribution, like uses of "bank" that are always followed by "of". Helpful for local perplexity, but not really what you want if you're looking for semantic senses, probably.
NAACL 2018 demo paper, by Hao Fang, Hao Cheng, Maarten Sap, Elizabeth Clark, Ari Holtzman, Yejin Choi, Noah A. Smith, and Mari Ostendorf Sounding Board was the system that won the 2017 Amazon Alexa Prize, a competition to build a social chatbot that interacts with users as an Alexa skill. Hao comes on the podcast to tell us about the project. We talk for a little bit about how Sounding Board works, but spend most of the conversation talking about what these chatbots can do - the competition setup, some example interactions, the limits of current systems, and how chatbots might be more useful in the future. Even the best current systems seem pretty limited, but the potential future uses are compelling enough to warrant continued research.
NAACL 2018 Outstanding Paper by Elizabeth Clark, Yangfeng Ji, and Noah A. Smith Both Elizabeth and Yangfeng come on the podcast to tell us about their work. This paper is an extension of an EMNLP 2017 paper by Yangfeng and co-authors that introduced a language model that included explicit entity representations. Elizabeth and Yangfeng take that model, improve it a bit, and use it for creative narrative generation, with some interesting applications. We talk a little bit about the model, but mostly about how the model was used to generate narrative text, how it was evaluated, and what other interesting applications there are of this idea. The punchline is that this model does a better job at generating coherent stories than other generation techniques, because it can track the entities in the story better.
We've been experimenting with how we record the audio, trying to figure out how to get better audio quality. Sadly, this episode was a failed experiment, and there is a background hiss that we couldn't get rid of. Bear with us as we work on this...
NAACL 2018 paper by James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal James tells us about his paper, where they created a dataset for fact checking. We talk about how this dataset relates to other datasets, why a new one was needed, how it was built, and how well the initial baseline does on this task. There are some interesting side notes on bias in dataset construction, and on how "fact checking" relates to "fake news" ("fake news" could mean that an article is actively trying to deceive or mislead you; "fact checking" here is just determining if a single claim is true or false given a corpus of assumed-correct reference material). The baseline system does quite poorly, and the lowest-hanging fruit seems to be in improving the retrieval component that finds relevant supporting evidence for claims. There's a workshop and shared task coming up on this dataset: http://fever.ai/. The shared task test period starts on July 24th - get your systems ready!
ACL 2018 paper by Omer Goldman, Veronica Latcinnik, Udi Naveh, Amir Globerson, and Jonathan Berant Omer comes on to tell us about a class project (done mostly by undergraduates!) that made it into ACL. Omer and colleagues built a semantic parser that gets state-of-the-art results on the Cornell Natural Language Visual Reasoning dataset. They did this by using "abstract examples" - they replaced the entities in the questions and corresponding logical forms with their types, labeled about a hundred examples in this abstracted formalism, and used those labels to do data augmentation and train their parser. They also used some interesting caching tricks, and a discriminative reranker.
EMNLP 2017 paper by André F. T. Martins and Julia Kreutzer André comes on the podcast to talk to us the paper. We spend the bulk of the time talking about the two main contributions of the paper: how they applied the notion of "easy first" decoding to neural taggers, and the details of the constrained softmax that they introduced to accomplish this. We conclude that "easy first" might not be the right name for this - it's doing something that in the end is very similar to stacked self-attention, with standard independent decoding at the end. The particulars of the self-attention are inspired by "easy first", however, using a constrained softmax to enforce some novel constraints on the self-attention.
Upcoming JAIR paper by Sebastian Ruder, Ivan Vulić, and Anders Søgaard. Sebastian comes on to tell us about his survey. He creates a typology of cross-lingual word embedding methods, and we discuss why you might use cross-lingual embeddings (low-resource languages in particular), what information they capture (semantics? syntax? both?), how the methods work (lots of different ways), and how to evaluate the embeddings (best when you have an extrinsic task to evaluate on).
NAACL 2018 paper, by Matt Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Chris Clark, Kenton Lee, and Luke Zettlemoyer. In this episode, AI2's own Matt Peters comes on the show to talk about his recent work on ELMo embeddings, what some have called "the next word2vec". Matt has shown very convincingly that using a pre-trained bidirectional language model to get contextualized word representations performs substantially better than using static word vectors. He comes on the show to give us some more intuition about how and why this works, and to talk about some of the other things he tried and what's coming next.
In this episode, we take a more systems-oriented approach to NLP, looking at issues with writing deep learning code for NLP models. As a lot of people have discovered over the last few years, efficiently batching multiple examples together for fast training on a GPU can be very challenging with complex NLP models. James Bradbury comes on to tell us about Matchbox, his recent effort to provide a framework for automatic batching with pytorch. In the discussion, we talk about why batching is hard, why it's important, how other people have tried to solve this problem in the past, and what James' solution to the problem is. Code is available here:
ICLR 2018 paper, by Antoine Bosselut, Omer Levy, Ari Holtzman, Corin Ennis, Dieter Fox, and Yejin Choi. This is not your standard NLP task. This work tries to predict which entities change state over the course of a recipe (e.g., ingredients get combined into a batter, so entities merge, and then the batter gets baked, changing location, temperature, and "cookedness"). We talk to Antoine about the work, getting into details about how the data was collected, how the model works, and what some possible future directions are.
NAACL 2018 paper, by Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc'Aurelio Ranzato, from Facebook AI Research In this episode we continue our theme from last episode on structured prediction, talking with Sergey and Myle about their paper. They did a comprehensive set of experiments comparing many prior structured learning losses, applied to neural seq2seq models. We talk about the motivation for their work, what turned out to work well, and some details about some of their loss functions. They introduced a notion of a "pseudo reference", replacing the target output sequence with the highest scoring output on the beam during decoding, and we talk about some of the implications there. It also turns out the minimizing expected risk was the best overall training procedure that they found for these structured models.
EMNLP 2016 paper by Sam Wiseman and Sasha Rush. In this episode we talk with Sam about a paper from a couple of years ago on bringing back some ideas from structured prediction into neural seq2seq models. We talk about the classic problems in structured prediction of exposure bias, label bias, and locally normalized models, how people used to solve these problems, and how we can apply those solutions to modern neural seq2seq architectures using a technique that Sam and Sasha call Beam Search Optimization. (Note: while we said in the episode that BSO with beam size of 2 is equivalent to a token-level hinge loss, that's not quite accurate; it's close, but there are some subtle differences.)
NIPS 2017 paper by Vlad Niculae and Mathieu Blondel. Vlad comes on to tell us about his paper. Attentions are often computed in neural networks using a softmax operator, which maps scalar outputs from a model into a probability space over latent variables. There are lots of cases where this is not optimal, however, such as when you really want to encourage a sparse attention over your inputs, or when you have additional structural biases that could inform the model. Vlad and Mathieu have developed a theoretical framework for analyzing the options in this space, and in this episode we talk about that framework, some concrete instantiations of attention mechanisms from the framework, and how well these work.
ACL 2017 paper, by Paramita Mirza, Simon Razniewski, Fariz Darari, and Gerhard Weikum. There's not a whole lot of work on numbers in NLP, and getting good information out of numbers expressed in text can be challenging. In this episode, Paramita comes on to tell us about her efforts to use distant supervision to learn models that extract relation cardinalities from text. That is, given an entity and a relation in a knowledge base, like "Barack Obama" and "has child", the goal is to extract _how many_ related entities there are (in this case, two). There are a lot of challenges in getting this to work well, and Paramita describes some of those, and how she solved them.
EMNLP 2017 paper by Bishan Yang and Tom Mitchell. Bishan tells us about her experiments on frame-semantic parsing / semantic role labeling, which is trying to recover the predicate-argument structure from natural language sentences, as well as categorize those structures into a pre-defined event schema (in the case of frame-semantic parsing). Bishan had two interesting ideas here: (1) use a technique similar to model distillation to combine two different model structures (her "sequential" and "relational" models), and (2) use constraints on arguments across frames in the same sentence to get a more coherent global labeling of the sentence. We talk about these contributions, and also touch on "open" versus "closed" semantics, in both predicate-argument structure and information extraction.
AAAI 2017 paper, by Dan Roth. In this episode we have a conversation with Dan about what he means by "incidental supervision", and how it's related to ideas in reinforcement learning and representation learning. For many tasks, there are signals you can get from seemingly unrelated data that will help you in making predictions. Leveraging the international news cycle to learn transliteration models for named entities is one example of this, as is the current trend in NLP of using language models or other multi-task signals to do better representation learning for your end task. Dan argues that we need to be thinking about this more explicitly in our research, instead of learning everything "end-to-end", as we will never have enough data to learn complex tasks directly from annotations alone.
How should you incorporate background knowledge into a neural net? A lot of people have been thinking about this problem, and Dirk Weissenborn comes on to tell us about his work in this area. Paper is with Tomáš Kočiský and Chris Dyer.
TACL 2017 paper by Jonathan K. Kummerfeld and Dan Klein. Jonathan tells us about his work on parsing algorithms that capture traces and null elements in sentence structure. We spend the first third of the conversation talking about what these are and why they are interesting - if you want to correctly handle wh-movement, or coordinating structures, or control structures, or many other phenomena that we commonly see in language, you really want to handle traces and null elements, but most current parsers totally ignore these phenomena. The second third of the conversation is about how the parser works, and we conclude by talking about some of the implications of the work, and where to go next - should we really be pushing harder on capturing linguistic structure when everyone seems to be going towards end-to-end learning on some higher-level task?
Jonathan also blogs about papers that he's reading; check out his paper summaries at http://jkk.name/
How robust is your NLP system? High numbers on common datasets can be misleading, as most systems are easily fooled by small modifications that would not be hard for humans to understand. Allyson Ettinger, Sudha Rao, Hal Daumé III, and Emily Bender organized a workshop trying to characterize this issue, inviting participants to either build robust systems, or try to break them with targeted examples. Allyson and Sudha come on the podcast to talk about the workshop. We cover the motivation of the workshop, what a "minimal pair" is, what tasks the workshop focused on and why, and what the main takeaways of the workshop were.
Anders talks with us about his line of work on doing NLP in languages where you have no linguistic resources other than a Bible translation or other religious works. He and his students have developed methods for annotation projection for both part of speech tagging and dependency parsing, aggregating information from many languages to predict annotations for languages where you have no training data. We talk about low-resource NLP generally, then dive into the specifics of the annotation projection method that Anders used, also touching on a related paper on learning cross-lingual word embeddings.
EMNLP 2017 paper by Ramakanth Pasunuru and Mohit Bansal Ram and Mohit join us to talk about their work, which uses reinforcement learning to improve performance on a video captioning task. They directly optimize CIDEr, a popular image/video captioning metric, using policy gradient methods, then use a modified version of CIDEr that penalizes the model when it fails to produce a caption that is _entailed_ by the correct caption. In our discussion, we hit on what video captioning is, what typical models look like for this task, and how the entailment-based reward function is similar to other attempts to be smart about handling paraphrases when evaluating or training language generation models. Unfortunately, due to some technical issues, the audio recording is a little worse than usual for this episode. Our apologies.
Paper is by Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, and Percy Liang In this episode, Kelvin tells us how to build a language model that starts from a prototype sentence instead of starting from scratch, enabling much more grammatical and diverse language modeling results. In the process, Kelvin gives us a really good intuitive explanation for how variational autoencoders work, we talk about some of the details of the model they used, and some of the implications of the work - can you use this for better summarization, or machine translation, or dialogue responses?
TACL 2017 paper, by Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. Most relation extraction work focuses on binary relations, like (Seattle, located in, Washington), because extracting n-ary relations is difficult. Nanyun (Violet) and her colleagues came up with a model to extract n-ary relations, focusing on drug-mutation-gene interactions, using graph LSTMs (a construct pretty similar to graph CNNs, which was developed around the same time). Nanyun comes on the podcast to tell us about her work.
Recent arxiv paper by Gábor Melis, Chris Dyer, and Phil Blunsom. Gábor comes on the podcast to tell us about his work. He performs a thorough comparison between vanilla LSTMs and recurrent highway networks on the language modeling task, showing that when both methods are given equal amounts of hyperparameter tuning, LSTMs perform better, in contrast to prior work claiming that recurrent highway networks perform better. We talk about parameter tuning, training variance, language model evaluation, and other related issues.
Isabelle Augenstein was the lead organizer of SemEval 2017 task 10, on extracting keyphrases and relations from scientific publications. In this episode we talk about her experience organizing the task, how the task was set up, and what the result of the task was. We also talk about some related work Isabelle did on multi-task learning for keyphrase boundary detection.
ACL 2017 best resource paper, by Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi
Alane joins us on the podcast to tell us about the dataset, which contains images paired with natural language descriptions of the images, where the task is to decide whether the description is true or false. Alane tells us about the motivation for creating the new dataset, how it was constructed, the way they elicited complex language from crowd workers, and why the dataset is an interesting target for future research.
In this episode we talk about a couple of recent papers that get at the issue of training variance, and why we should not just take the max from a training distribution when reporting results. Sadly, our current focus on performance in leaderboards only exacerbates these issues, and (in my opinion) encourages bad science.
NIPS 2017 paper. We dig into the details of the Transformer, from the "attention is all you need" paper. Ashish and Jakob give us some motivation for replacing RNNs and CNNs with a more parallelizable self-attention mechanism, they describe how this mechanism works, and then we spend the bulk of the episode trying to get their intuitions for _why_ it works. We discuss the positional encoding mechanism, multi-headed attention, trying to use these ideas to replace encoders in other models, and what the self-attention actually learns. Turns out that the lower layers learn something like n-grams (similar to CNNs), and the higher layers learn more semantic-y things, like coreference.
Minor correction: Talking about complexity equations without the paper in front of you can be tricky, and Ashish and Jakob may have gotten some of the details slightly wrong when we were discussing computational complexity. The high-level point is that self-attention is cheaper than RNNs when the hidden dimension is higher than the sequence length. See the paper for more details.
TACL 2017 paper by Rotem Dror, Gili Baumer, Marina Bogomolov, and Roi Reichart. Roi comes on to talk to us about how to make better statistical comparisons between two methods when there are multiple datasets in the comparison. This paper shows that there are more powerful methods available than the occasionally-used Bonferroni correction, and using the better methods can let you make stronger, statistically-valid conclusions. We talk a bit also about how the assumptions you make about your data can affect the statistical tests that you perform, and briefly mention other issues in replicability / reproducibility, like training variance.
ACL 2017 paper by Jacob Andreas, Anca D. Dragan, and Dan Klein. Jacob comes on to tell us about the paper. The paper focuses on multi-agent dialogue tasks, where two learning systems need to figure out a way to communicate with each other to solve some problem. These agents might be figuring out communication protocols that are very different from what humans would come up with in the same situation, and Jacob introduces some clever ways to figure out what the learned communication protocol looks like - you find human messages that induce the same beliefs in the listener as the robot messages. Jacob tells us about this work, and we conclude with a brief discussion of the more general issue of interpreting neural models.
EMNLP 2017 paper by Nitish Gupta, Sameer Singh, and Dan Roth. Nitish comes on to talk to us about his paper, which presents a new entity linking model that both unifies prior sources of information into a single neural model, and trains that model in a domain-agnostic way, so it can be transferred to new domains without much performance degradation.
CoNLL 2017 paper, by Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila Zilles, Yejin Choi, and Noah A. Smith. Roy comes on to talk to us about the paper. They analyzed the ROCStories corpus, which was created with three separate tasks on Mechanical Turk. They found that there were enough stylistic differences between the text generated from each task that they could get very good performance on the ROCStories cloze task just by looking at the style, ignoring the information you're supposed to use to solve the task. Roy talks to us about this finding, and about how hard it is to generate datasets that don't have some kind of flaw (hint: they all have problems).
ICLR 2017 paper by Hakan Inan, Khashayar Khosravi, Richard Socher, presented by Waleed. The paper presents some tricks for training better language models. It introduces a modified loss function for language modeling, where producing a word that is similar to the target word is not penalized as much as producing a word that is very different to the target (I've seen this in other places, e.g., image classification, but not in language modeling). They also give theoretical and empirical justification for tying input and output embeddings.
Paper by Ryan Cotterell and Jason Eisner, presented by Matt. This paper won the best paper award at ACL 2017. It's also quite outside the typical focus areas that you see at NLP conferences, trying to build generative models of vowel vocabularies in languages. That means we give quite a bit of set up, to try to help someone not familiar with this area understand what's going on. That makes this episode quite a bit longer than a typical non-interview episode.
ACL 2017 paper, by Yusuke Oda and others (including Graham Neubig) at Nara Institute of Science and Technology (Graham is now at Carnegie Mellon University).
Graham comes on to talk to us about neural machine translation generally, and about this ACL paper in particular. We spend the first half of the episode talking about major milestones in neural machine translation and why it is so much more effective than previous methods (spoiler: stronger language models help a lot). We then talk about the specifics of binary code prediction, how it's related to a hierarchical or class-factored softmax, and how to make it robust to off-by-one-bit errors.
Paper link: www.semanticscholar.org/paper/Neural-…84d5867e6358d
An example of the Language log posts that we discussed: languagelog.ldc.upenn.edu/nll/?p=33613 (there are many more).
NIPS 2016 paper by Alexander Ratner and coauthors in Chris Ré's group at Stanford, presented by Waleed. The paper presents a method for generating labels for an unlabeled dataset by combining a number of weak labelers. This changes the annotation effort from looking at individual examples to constructing a large number of noisy labeling heuristics, a task the authors call "data programming". Then you learn a model that intelligently aggregates information from the weak labelers to create a weighted "supervised" training set. We talk about this method, how it works, how it's related to ideas like co-training, and when you might want to use it.
ACL 2017 paper by Yonatan Belinkov and others at MIT and QCRI. Yonatan comes on to tell us about their work. They trained a neural MT system, then learned models on top of the NMT representation layers to do morphology tasks, trying to probe how much morphological information is encoded by the MT system. We talk about the specifics of their model and experiments, insights they got from doing these experiments, and how this work relates to other work on representation learning in NLP.
ICLR 2017 paper, by Yoon Kim, Carl Denton, Luong Hoang, and Sasha Rush. Yoon comes on to talk with us about his paper. The paper shows how standard attentions can be seen as an expected feature count computation, and can be generalized to other kinds of expected feature counts, as long as we have efficient, differentiable algorithms for computing those marginals, like the forward-backward and inside-outside algorithms. We talk with Yoon about how this works, the experiments they ran to test this idea, and interesting implications of their work.
Yoon also brought up a more recent paper by Yang Liu and Mirella Lapata that computes a very similar kind of structured attention, but does so much more efficiently. That paper is here: https://www.semanticscholar.org/paper/Learning-Structured-Text-Representations-Liu-Lapata/4435c3586364e8f8a2c8c9ee671c39d7df7e196c.
ACL 2017 short paper, by Jonathan Herzig and Jonathan Berant. This is a nice, obvious-in-hindsight paper that applies a frustratingly-easy-domain-adaptation-like approach to semantic parsing, similar to the multi-task semantic dependency parsing approach we talked to Noah Smith about recently. Because there is limited training data available for complex logical constructs (like argmax, or comparatives), but the mapping from language onto these constructions is typically constant across domains, domain adaptation can give a nice, though somewhat small, boost in performance. NB: I felt like I struggled a bit with describing this clearly. Not my best episode. Hopefully it's still useful.
ACL 2016 outstanding paper, by Vered Shwartz, Yoav Goldberg and Ido Dagan. Waleed presents this paper, discussing hypernymy detection and the methods used in the paper. It's pretty similar to work in relation extraction and knowledge base completion, so we also talk a bit about connections to other methods we're familiar with. Encoding paths using an RNN like they do (and like Arvind Neelakantan did for KBC) improves recall substantially, at the cost of some precision, which makes intuitive sense.
ACL 2017 paper by Abigail See, Peter Liu, and Chris Manning. Matt presents the paper, describing the task (summarization on CNN/Daily Mail), the model (the standard copy + generate model that people are using these days, plus a nice coverage loss term), and the results (can't beat the extractive baseline, but coming close). It's a nice paper - very well written, interesting discussion section.
Abigail also has a very nice blog post where she describes her work in a less formal tone than the paper: http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html
An interview with Noah Smith. Noah tells us about his work with his students Hao Peng and Sam Thomson. We talk about what semantic dependency parsing is, the model that they used to approach the problem, how multi-task learning fits into this with a graph-based parser, and end with a little discussion about representation learning.
Maruan, Avinava Dubey and Eric Xing essentially put the post-hoc decision boundary explanations from the "Why Should I Trust You?" paper as a core component of a predictive model. Maruan comes on to tell us about it.
The recently-hyped paper that got "superhuman" performance on FAIR's CLEVR dataset. https://arxiv.org/abs/1706.01427
An interview with Tim Rocktäschel. https://arxiv.org/abs/1705.11040
An interview with Chris Dyer. https://www.semanticscholar.org/paper/Recurrent-Neural-Network-Grammars-Dyer-Kuncoro/1594d954abc650bce2db445c52a76e49655efb0c
In this episode we briefly say what we're up to with the podcast. No technical content, just a description of what each episode will look like, and why we're doing this.