Hello and welcome to the NLP highlights podcast where we talk about interesting recent work in natural language processing.
This is Matt Gardner and Waleed Ammar. We are research scientists at the Allen Institute for Artificial Intelligence.
Today’s paper is A Syntactic Neural Model for General-Purpose Code Generation by Pengcheng Yin, Graham Neubig at the Language Technologies Institute at Carnegie Mellon University.
This paper is, I think of it as semantic parsing because I’ve done semantic parsing and this is what I’m familiar with. They didn’t position it quite that way, but the high level point here is they’re taking natural language descriptions of code and trying to generate the code from those descriptions, which I think is a fascinating task. So when I think of semantic parsing, it’s mapping language onto executable representations, some kind of meaning representation that you can execute against something. And so code generation is definitely in this same kind of framework.
It’s just a, probably a harder task than typical semantic parsing because the more traditional semantic parsing tasks have a much more limited language that you parse into.
Right, this line of work seems like a SciFi work to me. And I’ve seen a few papers recently on this topic and it’s facinating that we’re able to actually make progress on this kind of problem.
Yeah. Just to give an example of the kind of data that they’re working with. There’s this game called HearthStone. If you’re familiar with the game Magic the Gathering, this is a trading card game that people played, I don’t know, 20 years ago it started maybe, maybe longer. Wow. I’m old. I played this game as a kid and I’m just trying to remember how long ago it was. Anyway. so Blizzard recently made this new game called HearthStone, which is like Magic the Gathering but done on mobile devices.
And what they have is a dataset where people did an open source implementation of HearthStone. And then you can take all of the cards in this trading card game, take a text description of the card and try to predict from that text description the code for the class that implements that card in this open source implementation. So as a really simple example here, there’s this card that’s name is Brawl, it costs five whatever. Don’t worry too much about what that means. And its description is “destroy all minions except one chosen randomly.” And from that description you get a class Brawl that subclasses, a superclass called spell card. It has a constructor that has I won’t go too much into the details. It’s hard to see this in audio, but it’s got a constructor, it’s got a use method and other stuff. And all of this is code that was generated by the system. It doesn’t get this one perfectly, but it gets pretty close semantically to what the card is trying to do.
There’s some assumption here about what entities or what classes exist in addition to the brow card. How does the model make assumption about this?
Yeah, yeah. This is a whole lot of stuff to just generate. So hallucinate given only the description and the only reason it’s able to do this is because it has a whole lot of data. The only thing is that it’s seen are classes like this. And so you’re essentially learning a template from all of the training data and you’re filling in pieces of the template from what you see in the input like, the name and the cost are pretty standard things that would be really easy to get. Just from this template, the hardest part is this use method that actually has implementation code that corresponds to the description.
I see. So the deck of cards is like a training set and a test set.
Yes. Yes. I think you only, let’s see, they have in this training set, I think it was like 600 cards total. Finding the number here. Yeah, about 660 cards. 533 of them are train, 66 are development and 66 are test. And yeah, each card has one class associated with it. I’m pretty sure that’s true of this data set. I haven’t looked at it in detail.
Very interesting. So how is the code represented what is the representation for?
Yeah. so I think the first models that tried to do this used to just a vanilla kind of sequence-to- sequence with copying kind of neural net. The interesting contribution of this model is that they convert the code into an abstract syntax tree that a compiler would generate. And then they learn a sequence-to-tree style model where they encode the input description and then decode a tree. And there’ve been some previous sequence-to-tree methods that essentially are sequence-to-sequence, but using history in some interesting way. The innovation of this paper is to actually have a grammar over the trees that it generates. So instead of generating sequences in a tree structure, they’re generating application rules, actions that they can take. So this is actually pretty similar to shift reduce kinds of parsers and the stuff that we talked to Chris Dyer about a few episodes ago where you encode the state of the parser, which in this case is just the input text. And then you predict a series of actions. And in this case, those actions generate a tree very similar to what we saw before, except instead of a syntax tree, this is an abstract syntax tree that will be converted into actual code.
Note that you contrast this methods approach the approach described in this paper as a sequence-to- tree as opposed to a sequence-to-sequence model. It’s not too surprising that Graham Neubig someone who worked on machine translation for a long time, encoding a string to tree methods would be interested in this problem.
Yeah, I think it’s also, it’s a nice bridge between, so if you think of this in terms of semantic parsing traditionally semantic parsing has been done using some grammar formalism like CCG co commendatory categorial grammar. And there you’re generating a tree, I guess from the bottom up where you define a lexicon and each word has some syntax and semantics that go with it. And then you combine them using these bottom up application rules. And it’s been hard to think of how to get a neural model into these CCG style semantic parsers. And so some recent work at ACL 2016 and other papers said, let’s forget about this and just do a sequence-to-sequence model to take as input the texts and output a sequence that is the logical form. The trouble there is that you’re not guaranteed to generate valid logical forms.
And so the hard thing to think about is how do you get the niceness of the neural model that is powerful and has these nice representations combined with the tree structured formalisms that we had in CCG. And this is the same problem that this paper is trying to tackle. And I think this idea of encoding the sentence and decoding a tree inside of a grammar formalism makes a whole lot of sense. In fact, it makes so much sense that we did this in a paper that we just submitted to EMNLP. We were trying to do semantic parsing to WikiTables, a dataset where some folks at Stanford took a bunch of tables from Wikipedia and had people on Mechanical Turk asked questions about the table. So for example, you might get a table that has the results of which countries got how many metals in various years of the Olympics.
And one question might be, which country got the most gold metals in 2014. And so you have to convert that question into a logical form that gets executed against the table. I guess you could do it other ways, but if you want to do this as a semantic parsing problem, that’s how you would do it. And we developed independently of this work. I didn’t know about this until I read this paper yesterday. We developed a model that the backbone is almost identical to this. We have an encoder biLSTM that gets a representation of the input question and then we decode a sequence of actions that are production rules over this grammar. The details of the grammar and how exactly the works are going to be different because our data is very different from this code generation data. But the underlying framework is basically identical, which I thought was really interesting.
So what are the actions that this model relies on in order to construct the Abstract Tree?
Yeah. So there are really only, you can think of it as just one action but they split it into two. So given some type of something that you’re trying to generate you do an application that converts that type into something that the grammar says can be produced by that type. So if you think back in the days of PCFGs, I guess some people still do use these, but they’re not as common anymore. You had a series of grammar rules that says this non terminal can go to these other non terminals and each of these has an associated weight. We still have basically the same thing except it’s I have some type which used to be a non terminal now it’s a current representation of the tree or type or whatever the state of the parser that you’re in, there are some set of valid actions and then you want to compute a distribution over those actions that’s parameterized by a neural net that has access to all of the previous states of the actions that you’ve taken and everything.
So these models are very similar in that they have this grammar that says if I’m at this state, I can take this set of valid actions. But they’re very different because I no longer context free because you have this neural net that’s parameterizing the entire previous history. So yeah, I guess the actions that you can take are these different application rules that are valid at any particular state in the tree. And either that means it’s like a non terminal kind of production where you go from a non terminal to some other set of non terminals or you go from like a pre terminal to a terminal and those are basically the two kinds of actions that they have for your and they call them Apply Rule, which is analogous to the non terminal to set of non terminals or pre-terminals and GenToken which is analogous to pre-terminal goes to actual string.
So the set of rules is only extracted from the training set. So all the rules that have been observed in the training set will be valid for you to pick from.
Actually, I’m not sure they even use the training set for this so it was a little hard for me to gather from the paper. So I might be wrong here, but I think they just know from valid Python what the valid rules are like you know from the compiler what valid, what’s a valid abstract syntax tree. And so you can write that down just by hand yourself without looking at the data at all and know what is valid to generate. In our work on WikiTables, we didn’t have that. And so we did in fact learn a sequence of valid, learn a set of valid rules for each state or type given the training data. And I could be wrong here, they may have done this, but it wasn’t clear to me that they did from their paper.
So one action that has been Important in previous papers that do parsing in a similar fashion is an action that takes, consume one word from the buffer. And I wonder if there is such action here.
Not really. You do have some encoding of what, like the actions that you’ve already taken, which say that I’ve generated a particular word and you could imagine encoding like an attention over the input based off of what you generated. This is getting a little fuzzy. I don’t think they actually did this. I was thinking more of this neural checklist model by Chloé Kiddon and folks at UW that tries to keep track essentially of what parts of the input you’ve already covered. And it used an encoding of that as input to further actions. I think that’s the way that you would handle it in this setting. I don’t think they actually did that though in this model. So yeah, they don’t really have some notion of I popped something off the stack to know like what parts of the input I’ve covered. You could imagine adding that here, but I don’t think they did.
From the paper they seem to have a bi-direction RNN at the bottom that runs through the entire text description.
And that feeds into another hidden layer, which is then going to be used to make a prediction which rule to use.
Yeah. And they compute an attention, at each generation step at each action they take, they use the current hidden state of the parser to compute an attention over the input sequence. And they use that as input to the part of the model that predicts the next action.
That’s very interesting. So how does this compare to neural machine translation methods, which also use an attention mechanism over the sequence of words in the source sentence?
Yeah. The big difference is that is the addition of this abstract syntax tree generation, right? You’re decoding to a tree instead of to a sequence. And so you can encode a lot more constraints in your generative process. And they compared it in fact against a neural machine translation baseline. And as you would expect, they do way better. They, they have a metric that’s really strict, which says what’s the accuracy of the complete code that I generate? And like an exact match on the entire class that you’re supposed to generate. And for the HearthStone dataset, the neural machine translation system got 1.5% accuracy which is pretty abysmal. And their sequence, they also compare against a sequence-to-tree that doesn’t have the grammar formalism encoded so that I’m not sure what to call their system. They just label it our system in the table.
But, this grammar constrained sequence-to-tree model gets 16.2% exact match accuracy, which is way better than the neural machine translation. They also have some experiments that show that careful use of unknown token copying is one of the huge drivers of performance here because like the name Brawl in the card that we talked about earlier, you have to copy it to a few different places within the input. And so being able to say I’m copying a word from the input is really important in these kinds of models.
Yeah, absolutely. Yeah. Especially if you like get rid of the buffer, the idea of you’re consuming words from a buffer and adding it to the tree. That’s seems like an important mechanism.
Yeah. So this is a really interesting piece of work. I think that’s about all that we have to say. I guess it’s obvious that I think this is interesting because I did something very, very similar and I’m really excited about semantic parsing in general. I think this is a nice contribution.
Thanks for discussing this paper Matt. Next time you talk about a paper titled: Relation Extraction with Matrix Factorization and Universal Schemas. It was published in NAACL 2013, written by Sebastian Riedel, and other folks at the University of Massachusetts.