TACL 2018 paper by Jacob Buckman and Graham Neubig. Jacob tells us about marginalizing over latent structure in a sentence by doing a clever parameterization of a lattice with a model kind of like a tree LSTM. This lets you treat collocations as multi-word units, or allow words to have multiple senses, without having to commit to a particular segmentation or word sense disambiguation up front. We talk about how this works and what comes out. One interesting result that comes out of the sense lattice: learning word senses from a language modeling objective tends to give you senses that capture the mode of the "next word" distribution, like uses of "bank" that are always followed by "of". Helpful for local perplexity, but not really what you want if you're looking for semantic senses, probably. https://www.semanticscholar.org/paper/Neural-Lattice-Language-Models-Buckman-Neubig/f36b961ea5106c19c341763bd9942c1f09038e5d
Hello and welcome to the NLP highlights podcast where we talk about interesting work in natural language processing.
This is Matt Gardner and Waleed Ammar. We are research scientists at the Allen Institute for artificial intelligence.
Today our guest is Jacob Buckman. Jacob did a master’s degree in the language technologies Institute to CMU with Graham Neubig, he then did a Google AI residency and in a couple of weeks he will be starting a PhD program at Johns Hopkins to work with Jason Eisner, it’s good to have you Jacob with us.
Thanks so much for having me. Pleasure to be here.
Today we are talking or we’re talking about paper, got accepted to TACL called Neural Lattice Language Models. If you’re familiar with language modeling, you would know that what we’re trying to do is get a model that given some history of words that we’ve seen so far is trying to predict the next word. Typically we think of this as or these days. Most of them are some kind of recurrent neural network, like an LSTM that keeps a linear state and then tries to predict the next word. But Jacob, you’re introducing a lattice language model. What is this and how’s it different from the linear LSTM?
So the main motivation behind this is to include the sorts of inductive bias that lets us sort of intuitively understand that language does not happen one word at a time or one character at a time, if you’re working at the character level. But instead there’s certain chunks of meaningful semantic knowledge that go together sometimes into multi-award tokens. Sometimes you can have other arrangements as well. Essentially just trying to break up the paradigm that the best way to represent a sentence is one word at a time. You can think about that as a sort of jumping off point for a flexible framework that allows all sorts of different lattice layouts that correspond to different inductive biases about the texts you’re trying to model.
Can you tell us what kind of inductive bias you’re talking about here?
Yeah. So the two we dive into in this work specifically one of them is this multi word token idea, where in general, the semantic content of a word is sometimes different. You know, in a non compositional way when combined with the words adjacent to it. This is especially essential in languages that don’t have sort of spaces breaking up words, for example, a Chinese where these non compositional chunks exist, but they don’t exist explicitly in a way that’s easy to extract from the text by, you know, just separating out all the spaces. So a second inductive bias that I also cover in this paper is this idea that words have multiple meanings. So, you know, you could have a felicitous word that has multiple different completely disparate meanings and in different contexts it is used in totally different ways. And we don’t necessarily want to wrap all of that up in a single word level concept. So there’s others that you’ve envisioned. But those are the sort of the most, in my opinion, intuitively appealing versions of this model. But yeah, it can. It can be expanded to other cases as well.
So what if I have a sentence like “I went to the white house” where here white house is something that has non compositional meaning in the United States. How does your language model treat this differently than just a linear LSTM?
So a linear LSTM would predict first the probability that you see the word. So after you know, “I walked into the,” you then want to predict the probability that you have the word white and then given that you’ve seen the word white, you then also want to predict to the probability that you’ve seen the word house given that entire proceeding context. But what that loses is this idea that actually white house together, the bigram is, is a different concept. So in a lattice language model context, we actually would break this up in two ways. First, we would do the, the unigram predictions, just as you know, just as described first predict white and predict house. But then additionally we would have this second alternative. What if we see, I walked into the, and then white house, the bigram all wrapped up in a, in a single prediction. And in order to get the actual probability of the overall sentence, which is what we’re concerned with when doing language modeling, we would actually want to marginalize over these two cases. We include both the probability that we get at one word at a time as well as the probability that we get four unigrams and then the bigram of white house. And by adding those two probabilities together, we get the overall probability of producing this string as a whole.
So then if I’m understanding this right, you’re basically doing joint segmentation and language modeling. Is that what’s going on where you’re marginalizing over the segmentation?
Exactly. in this particular lattice structure, you would be doing it jointly with a segmentation. But in contrast to, you know, standard segmentation tasks, we don’t have ground truth segmentations. We let the neural models sort of learn which segmentations it implicitly prefers based on, you know, this sort of end to end perplexity optimization task.
Nice. And how, how do you know which things are chunks, which things you should chunk together, I guess? I think there’s been a lot of work on like finding colocations in this kind of stuff. How do you probe this way?
So we don’t explicitly do this chunking at any point. The model is smooth, continuous, fully differentiable. So we basically are training it to consider all possible segmentations at all times as possibilities and letting the model assign probability to unigrams in some cases bigrams and other cases in such a way that allows it to get the best perplexity. So what we do at the end to sort of do a post-talk qualitative inspection of what chunks it tends to find is we just take a sort of a, a best guessed Greedy Approximation and say, okay, if it were to simply pick the chunk that it assigned to the highest probability to and ignore all the others, which chunks would it pick? But that actually doesn’t explicitly get segmented out during the language modeling process itself.
So you said all possible segmentations. Say I have a sentence with N words. Are you really considering segments up to length N, this sounds very, very nasty,
Right? So that would be exponential, it’d be an exponentially large set of possible segmentations if we were to actually explicitly do each one. But by using basically a little dynamic programming trick, it’s possible to get a lot faster and calculate the overall perplexity of this marginalization without actually doing each individual segmentation by itself.
So you’re doing dynamic programming to marginalize over segmentation in the context of a language model. Is that a fair characterization of everything that’s going on?
So this will take care of the computational difficulty or complexity about it will not take over the large number of parameters that you need to represent in this model,
Right. So in this particular lattice you run into the second issue of even the tokens themselves, is an exponential number of these. So in order to handle that, we borrow from sort of the simultaneous word and character level language modeling literature, this is a multiple level language modeling. And we basically say, okay, what we’re going to do is we’re going to represent the tokens themselves, the probability of drawing any individual bigram or trigram by using another lower level LSTM that makes predictions at that level. And then we’re going to compress them into a single unified compositional embedding by using another compressor LSTM, if you will. So in the end, we, the multi-award token version of this model does require sort of a lot of heavy duty multilevel language modeling machinery. But the general mathematical form of this model, for example, if we talk about the multiple embeddings per word version of this model, doesn’t in general require all of that.
Well, so there’s something I didn’t quite get here. The original reason why we wanted to represent white house together because we think it’s a different concept and we don’t think it’s compositional in meaning to add up or to compose the position of white and house. However, that’s exactly how you’re computing yet. Or maybe I misunderstood,
Right, no, you’re, you’re 100% correct. So the one other piece that that fits into all this is we actually do have a finite number of non compositional embeddings as well, and when actually let’s say we wanted to compute the overall embedding that represents white house. If white house is in our finite a priori sort of ad hoc set of things that might be useful non compositional embeddings then we can catenate our non compositional embedding for white house with the compositional embedding for white house generated by our lower level LSTM. However we have exponentially many bigrams and trigrams and not exponential amounts of memory to store all these parameters. So we just take in my example of the top 10,000 most common bigrams and say each of these bigrams is a candidate for non compositional embedding. So it’s true that we could miss non compositional information in sort of long-tail bigrams that are still non compositional. But for the most part it seems that this does a good job of capturing a lot of the non-com positionality that we want to see and of course direction for future research. Is allocating these non compositional embeddings in a more useful or intelligent way rather than sort of just doing this ad hoc and gram approach.
On the other hand, frequency based collocations actually seem like they make a lot of sense. I was just at ACL and Ryan Cotterell will give a talk about tradeoffs in linguistic complexity that made me think about this. Where I guess this linguists have known this for a long time I’m just thinking of Ryan’s talk because I saw it last week, but irregular forms are inversely proportional to word frequency, right? So non compositional meanings or nonstandard morphology really can only survive in very frequently seeing context. Otherwise new learners just won’t be exposed to it enough and it will find the non-competition meeting. So you’re right that there are better things that you can do than just frequency. It’s not a bad first approximation.
Right? Yeah, that was, that was our interpretation going into it as well. Especially in this specific lattice of multiple word tokens. It seemed reasonable.
Okay. So then summarizing again you’re doing a joint segmentation and language modeling and the benefit that you’re hoping is that you can memorize some non compositional stuff into the, when you, when you do a segmentation, you can non compositional things like white house, memorize a vector for it and do have a better language model because you’ve memorized these, these things, right? I feel like doing this joint segmentation and language modeling must’ve been done before. I can’t think of a paper off the top of my head, but so is your contribution figuring out how to do this with an LSTM like with in the neural setting? Or has this really not been done before?
So there’s been, there’s been a lot of it somewhat similar work. This end to end modeling at the multiple word level has not been done in this form before. And also the generalization to other, other structures of lattice, for example, this polysemous lattice that I also presented the paper and, and other things. That’s really the where the novelty comes. So there’s a more abstract, formal way of representing this type of problem that emerges all over the place. When you have a certain linguistic intuition that doesn’t correspond well to the, you know, one word at a time LSTM paradigm.
So then your contribution then is figuring out, or at least part of it is figuring out how to get this to work in a lattice with neural techniques, right? So the complication here is that in a linear LSTM you only have one input to the recurrence at every time step. But here you have this lattice structure which will, it’s like a directed graph and it will allow multiple inputs to the LSTM cell every time step you have to figure out how to handle that. You’re going to want to explain this issue and figure it out. Tell us about how you solved it.
Yeah, absolutely. So as you, as you summarize very well, the key issue is that even if we can marginalize over the probabilities of a certain prefix of the lattice, which allows us to calculate, you know, the log likelihood and the perplexity which allows us to get an a loss term essentially to minimize the issue is that actually computing the probability of the next word is still conditional on the exact history of words that led us to this point. So if we want to get the explicit exact correct, best possible prediction of the next word, we still then would need to marginalize over exponentially many possible prefixes possible segmentation histories. So what we do instead is we use a couple of different techniques. We actually explore different techniques for deciding how to best approximate this hidden state. So rather than having the true hidden state for each of the exponentially many histories, we have a single hidden state that summarizes or approximates all the different types of information we would get during the many possible histories we could have seen.
And we let the backdrop sort of decide which of these histories information content to prioritize. White house for example. Maybe if you know the white house, the white house came earlier in the sentence and then towards the end we were trying to come up with the next word is going to be or maybe to come after president or something. Then a unified single concept will help us later on predicted that presidents probably refers to the president of the United States and not to a different president. And that will help us with our prediction task down the line there. So, but maybe in a different context where we’re simply painting a house white, then we don’t want to actually use that compositional meaning and the president maybe refers to the president of a paint company or something. So holding both of these possibilities in sort of neural memory rather than having this a single historical chain of events that a standard LSTM holds is another key challenge in constructing these sorts of models. And in the paper we explore a couple of different ways to do this approximation and compare them against each other as well.
And it looked like the best performing one that basically computes the kind of attention over the inputs and weights them accordingly. So at least it gets a probability distribution that a lot of people think of these days. It’s kind of like an attention,
Right? It’s very similar in that we’re basically saying given a certain pre-fixed we know which chains of words are most likely to have gotten us here. So we know for example, whether at this point in the sentence we’re more likely to have reached this point by going through white house as a unified token or by going through white and houses two separate tokens and by re weighting the hidden sates that are generated by those two possibilities in proportion to the likelihood that we actually reached this point. By going through that token, we ended up seeing the best performance out of the techniques we tried.
So how big are these lattices? This sounds like it could really, really slow down language modeling.
Yeah. And it certainly does, which is, which is definitely a weakness. One of the actual biggest issues is in terms of computational efficiency is the fact that we actually need to have the same lattice structure for every sentence in every mini bash in order to do many bashed computation. So that means that we can’t only have sort of these, these non compositional steps in places where we have the non-competition embeddings. We actually need to have them everywhere because we need to duplicate the computation across everything in the batch anyway. So that plus the fact that we now have to sort of, we have the higher level LSTM as well as the lower level inner token predicting LSTM that allows us to represent exponentially many tokens. It does get very heavy weight and quite a bit slower, but it does correspond to a small amount of gains in perplexity even when you take all the additional parameters into account. The bigger issue is just, you know, the runtime as well as fitting it on the GPU if you want to use a larger model.
Yeah. we, we very often make tradeoffs performance versus runtime. Right. So it’s, that’s not a problem. I’m just curious. So how much is, is it actually slower? How much slower? Do you have the number?
Yeah, so looking at the baseline LSTM, vanilla LSTM, the largest model, which is sort of the state of the art sized LSTM takes about half a second per batch and running a bigram lattice of this size takes about four and a half seconds per batch. So about nine times slower. So definitely a significant slow down there.
I guess it’s not as bad as it could be, but that is still quite slow down. Interesting. We haven’t talked too much about the Multi sense lattices. Our examples have been about the collocations like white house. Can you give some more details on how you do the multi sense version?
So in terms of marginalization, the idea is very similar and that we have sort of multiple possible ways we could have rolled out this sentence besides one word at a time. And we want to marginalize over all the different ways we could have done it. And this multi sense-embedding what we have is, we still break it down to one word at a time, but we say that for any given word we could have predicted actually version one of this word or version two of this word. And we want to marginalize over both of those two senses. So let’s say we get to, and we can still even use the the white house example, but let’s say we get to the word white and I suppose white really has a pretty concrete meaning. Let’s say we we’re going to the bank. I walked, I walked into the bank and we want to know whether it’s a financial institution or a river bank. We would have one embedding representing each of those. And what we do is we send the probability of predicting bank river and bank financial institution and by adding, those two probabilities together, you get the overall probability predicting that, that sequence of tokens again.
Okay. So before you had something that looked a lot more like a directed a cyclic, like there was a lot more dense. I’ve had connections in this lattice going more places. But here basically you have a linear chain, but at every step you have to have two different arrows to the next node. Is that fair? So very difficult and very different kind of lattice.
The math and most of the code works out exactly the same.
Interesting. Can you combine these two? Does that work?
You certainly can, but unfortunately it requires a lot of memory and time because of course you’re doing now, let’s say you have two different possible meanings for each bigram or for each trigram. Now you just have a absolutely enormous amount of things to consider. Essentially and it blows up to the point where I didn’t attempted in this work.
So how well does all of this work? You want to tell us about the experiments that you ran?
Yeah, absolutely. So I ran a several experiments on both English and Chinese data. The English language experiments came from the billion word Corpus and the Chinese language experiments came from a subset of a news dataset Guangming Daily. And essentially what we found is that in pretty much all cases, even comparing, both to standard baselines and to sort of parameter added baselines where we augment the parameters counts of the baselines to make sure that they’re equivalent to our neural lattice versions. We see that both multi-word tokens and multi embedding words lead to fairly substantial gains in language modeling on these tasks. So one thing that was interesting qualitatively or so to, to give some numbers on English using multiple embeddings per word reduces the perplexity from a 48 down to 43 points. And on the Chinese task, it goes from a 40.2 32 point baseline when using multi token chunks.
And so one thing that’s interesting is that on English, where you sort of have these spaces elimiting things already, but you have a larger amount of publicity, you actually see much bigger gains from including multiple embeddings per word. And in contrast, in the Chinese experiments, when you have a sort of many more characters but no explicit spaces separating out the different semantic chunks, you actually see much bigger gains from modeling these in a multi token chunk contexts. So this is sort of very well aligned with our expectations and it’s, it’s it was cool to see it work out like that.
Nice. That, yeah, that’s interesting and you got some, I was a little bit interested surprised by what I saw in your qualitative discussion where you actually looked at what sense is the model learned? Can you tell us about what you found here?
Yeah, absolutely. It was a bit surprising to be honest. Somewhat disappointing to me too. But after thinking about it a bit, it’s sort of makes sense. So the unintuitive finding here here was that senses earned by the model do not correspond at all seemingly to semantic intuitions about the uses of this word. So actually to that give us, have, I should clarify that the nulti-sense lattice learns the different senses in a end-to-end fashion. So at no point do we explicitly inform the model what these two senses are meant to be. All we do is provide two different embeddings for the word or three or more depending on, you know, the number of modes you want for the embeddings we provide multiple initializations so initialize a different places so they, their training dynamics are different and we let the model learn end to end how best to represent information in these different embeddings to capture the maximum amount of probability of the training distribution.
So what we would expect, right? It sort of what I described before, if do have two embeddings for bank, then you want one of them to be embedding for bank, the financial institution and one of the embeddings to be for bank of the river for example. Or maybe one of them is for bank, like to bank a plane. If that is more common, these embeddings sort of get repurposed for a bizarrely explicit syntactic context almost at the entire absence of semantic information. So actually what we see is that the two embeddings for the word bank, one of them is explicitly to represent the word bank in the context, the bank of country name. So the bank of England, the Royal bank of Scotland, bank of Japan. And that is the only time in the entire Corpus this data, that this embedding for this sense of bank as used every other sense of bank including it’s a bank holiday, which corresponds to the same semantic sense of bank as the previous example.
All of these get bundled up in the other version of the bank embedding. And what this sort of means is that the model given the flexibility to do so chooses to overfit to sort of these syntactic irregularities in the data set where clearly, you know this was a, the billion word Corpus is sampled from, from news corpuses. So the bank of country name is probably a little bit overrepresented. And what it indicated to me was that in contrast to expectations, the things that we capture when training something like a learned embedding are likely capturing a lot more syntactic context than we often would like to believe. We often would like to believe that is capturing some deep semantic in summation and these results sort of indicate that that may not be completely true.
Yeah, I thought this was super interesting. I hadn’t really thought about this before. Your paper had made me think a lot, which is a great thing and I, it made me think of why this entropy loss in predicting the next word. Right? And so if it, so say there’s, when you, when you see banks there’s some distribution over next words. If you can pull out the mode of that distribution and assign it like the most frequent export, get a separate embedding for the word bank that only captured like that says explicitly, I always see this word next. You can dramatically lower your perplexity for that instance, which is the most frequent instance. And then just use the other embedding to capture all the rest, which is, it’s kind of obvious when you like think about the math, but it’s not something you would think about without seeing this. It’s fascinating,
Right? Yeah, it was. It was pretty surprising to me as well. Another thing that I found to be somewhat interesting is that certain words, there was only one mode, and this was actually, this actually came out in examining what the model learns. So for example, the word a word, rodham. Yeah. You know, Hillary Rodham Clinton is the only context that appeared in the entire dataset. And in this case, the model simply always preferred to use that single embedding. It ignored the other beddings that we made available to it. So in this context, I think that we’re seeing this behavior where if it can capture it in a single mode, it will, is happy to. It makes me interested to see whether we could use this behavior to actually improve neuro lattice language models by dynamically assigning more or fewer embeddings by picking off one mode at a time.
So in that sense, even though it’s a bit disappointing to see it taking this, you know, as you said, this, this optimizing cross entropy in an in a, in a way that doesn’t help us semantically, I’d be interested to see if we can maybe throw enough computation at it and let it keep picking off modes to the point where we’ve really gotten good coverage of the distribution. Maybe at that point we actually would be able to see the whole sort of the whole space of semantic meanings represented by unique modes. Of course, one semantic meaning would still have multiple modes from multiple syntactic contexts, most likely. But maybe we could do a, another form of clustering after that to group them together. That’s a future direction of research potentially.
Yeah. Interesting. Were there other cases besides just rodham that that only had a single sense that seems like pretty rare? Like you would only get this for very rare words.
Yeah. So it took me a bit of digging to find it. I believe there were some other ones, but I unfortunately forgot what they were. That was the example I included in the paper. But but right, as you said, it’s only very, rare very, unusual words. I think they were all names of people or places.
Yeah. I had another conversation at ACL recently thinking about like using language models to do word sense induction and looking at this paper makes me a little bit skeptical that that will work like we want it to. Right. Because we’re a lot more likely to find syntactic senses than semantic senses. And I’m, I’m sure people that work on word sense induction. I have found this out at one time ago. It’s just new to me.
Yeah, it was new to me as well. Yeah.
Interesting. Great. Jacob, thanks. This has been a really interesting conversation. Do you have any last thoughts before we conclude?
Yeah. I just want to say thank you so much for having me. I’m excited it to see what did this sort of research goes.
Great. Thanks for coming on. It’s nice talking to you.
You as well. Have a good one.