Empiricist practice has a deep flaw that is rarely discussed. At its origins, cryptanalysts were working on sources very restricted in content, and vocabulary. After all, it was not so likely that Enigma traffic would have much outside the military order of the day, weather, and the like, or go full Jabberwocky. When you get enough token counts for traffic like that, you pretty much know everything you can know. It's what Harris profoundly noted in the differences between technical languages and general language. Popular tasks of the empiricist era, from ATIS to PTB, were similarly restricted (travel, business news, ...). What this means is that typical count-based empiricist methods do much better on their own benchmarks than in real life. Just try to parse the Web (let alone social media or chat) with a PTB parser to see what I mean.
Where a lot of training data can be collected in the wild — most notably, parallel translation corpora — empiricist practice with enough counts limps along the long tail, although it's touch-and-go when the counts get small, as they always do.
Another way to see this is that empiricist dogma was protected from its own demise by a rather convenient choice of evaluations. Those of us who have worked hard to apply these methods to real data know well the struggle with small token counts, and the dismaying realization that fancier statistical methods (such as latent-variable models) are most often a waste of effort because in practical situations, a flat count-based model (or linear models) can do as well as can be hoped with the data at hand. Those of us who thought a bit more about this started to realize that token counting and its variants could not generalize effectively across “similar” tokens. We tried many different recipes to alleviate the problem (eg. class-based language models), but they were all ineffective or computationally infeasible (I know, I co-authored quite a few papers in that mode, including at least a couple of best papers — goes to show the limited horizons of program committees).
It's here that deep learners carrying warming GPUs descended from the Northern wastes to lay siege to the empiricist (bean) counters and their sacred metrics. First language modeling, then machine translation fell, in no little measure thanks to their ability to learn usage and meaning generalizations much better than counting could. The modularity of NN models made it easier to explore model design space. Recurrent gated models (thank you Hochreiter and Schmidhuber!) managed history in a much more flexible and adaptable way than any of the history-counting tricks of the previous two decades. It was a rout.
The excitement of the advance was irresistible. Researchers involved, experiments, and papers grew fast, in my estimate 4x from 2010 to 2017. Publication venues overflowed, and researchers burning ever more fossil fuel with their GPUs (morality tale warning!) turned to arXiv to plant ever more flags on their marches through newly conquered lands (not the most culturally apt of behaviors, it must be admitted).
But was the invasion so glorious? Very few of the standard tasks have the very large training sets of language modeling or translation that large-scale SGD depends on. Some tasks with carefully created training sets, like parsing, showed significant but not as striking gains with deep learning. There are exciting results in transfer learning (such as the zero shot translation results), but they rely in starting from models trained on a whole lot of data. However, when we get to tasks for which we have only evaluation data, where count-based models can still do decent work (clustering, generative models), deep learning does not have yet a superior answer.
For continuous outputs, GANs have made a lot of progress. At least, the pictures are stunning. But as I discovered when I worked on distributional clustering, stunning is very much in the eye of the beholder. Proxies like the word association tasks so popular in evaluating word embeddings are almost embarrassingly low-discrimination compared with the size of the models being evaluated. Sensible ways of evaluating GANs for text are even scarcer.
In defense of the Northern hordes, the empiricist burn-it-to-the-ground campaign left little standing that could promote a new way of life. Once the famous empiricist redoubts are conquered or at least laid siege to, how does the campaign continue?
Idea! Let's go back to toy problems where we can create the test conditions easily, like the rationalists did back then (even if we don't realize we are imitating them). After all, Atari is not real life, but it still demonstrates remarkable RL progress. Let's make the Ataris of natural language!
But now the rationalists converted to empiricism (with the extra enthusiasm of the convert) complain bitterly. Not fair, Atari is not real life!
Of course it is not. But neither is PTB, nor any of the standard empiricist tasks, which try strenuously to imitate wild language (their funding depends on it!) but really fail, as Harris predicted back in the 1950s. Or even the best of descriptive linguistics, which leaves in the murk all those messy deviations from the nice combinatorics of the descriptive model.
12 comments:
I like Müller's (briefly) brave traveller from Winterreise a lot more than Schikenader's
archetypes. And the music is just as good, though in a different way.
Mut
Fliegt der Schnee mir ins Gesicht,
Schüttl' ich ihn herunter.
Wenn mein Herz im Busen spricht,
Sing' ich hell und munter.
Höre nicht, was es mir sagt,
Habe keine Ohren;
Fühle nicht, was es mir klagt,
Klagen ist für Toren.
Lustig in die Welt hinein
Gegen Wind und Wetter !
Will kein Gott auf Erden sein,
Sind wir selber Götter !
A decade ago, Ken Church talked about the rationalism/empiricism "pendulum," in his words. I wonder what he would say after another swing of the pendulum overt the past decade?
Ken Church Link: http://languagelog.ldc.upenn.edu/myl/ldc/swung-too-far.pdf
Quite excellent sir! I've pondered how the varied understandings within humanity (in time, space, and experience) of the word 'apple'. "AI" still feels like interacting with children or the inexperienced. What will it take for people to find technologies (magic?) to be on par with our complexity?
Beautiful essay, so very true.
You got to get on FB, Fernando! Thats where all the conversations are :-).
My response: https://www.facebook.com/saraswat/posts/10154848628252098?comment_id=10154848654822098¬if_t=like¬if_id=1497270182088850
Yann LeCun's response to Yoav's post: https://www.facebook.com/yann.lecun/posts/10154498539442143
I worry most about the training that students are getting these days. I commented on this issue 8 years ago (http://rws.xoba.com/newindex/ncfom.html) in the pre-deep learning days, when it seemed that already the field of NLP was turning into Applied Machine Learning. Indeed I have been calling it that ever since. Others (e.g. Kevin Knight) had made similar comments about students who know everything about MaxEnt, but don't know what a noun phrase is. This will presumably only get worse before it gets better.
The easiest way for the pendulum to swing back would be for neural systems to fail to do what they are supposed to do. In text-to-speech, for example, where a fully neural system reads text in an embarrassing or incomprehensible way. It won't be so easy to repair that by simply providing yet more data, with any kind of guarantee that the problem will be fixed, or that new problems would not be introduced.
What about connecting language to the rest of cognition? Studying it in isolation only goes so far. I think James Allen's work, and the Open Dialogue work at MSR, are two examples of language in contexts that provide good laboratories for making progress. Learning by reading, evaluated via human-normed tests, is another. We evaluate our LbR systems by question-answering as much as possible, since that's ultimately one of the things that one wants to do with language.
I'm working on a reply in my next opinion piece for NLE. The last one was: https://www.cambridge.org/core/journals/natural-language-engineering/article/div-classtitleemerging-trends-i-did-it-i-did-it-i-did-it-but-div/E04A550C6DFF0154C684888B7B9F68EA.
Short answer: we basically agree on the past, but the future is harder to get right. There hasn't been as much movement toward rationalism as I had expected, and teaching has become even more focused on whatever happens to be hot recently, as Richard points out below.
I heard a really interesting interview on NPR recently about strong opinions and content-free debate. http://www.npr.org/2017/06/12/532242775/hue-1968-revisits-an-american-turning-point-in-the-war-in-vietnam
"You know, for me personally, you know, I grew up - I was in high school when this war was going on. I used to have these knock-down, drag-out fights with my dad over the war. I was against it, and he was in favor of it. And neither of us knew enough to have a really strong opinion. But we had them anyway."
The author then goes on to credit his father with teaching him to do his homework and get the facts, and doing so taught him to be successful in his career. It is ok to reject the classics, but not without reading them. Shameless plug for my course at Columbia in the fall: http://www.columbia.edu/~kc3109/topics_in_HLT_V3.htm. We will read the classics (and maybe reject them).
Richard: What's a noun phrase? :P
Seriously, Fernando, great essay. I'm grateful you were able to post this as a reasonable account in the face of the heightened emotions both on display and generated by Yoav's screed. It is helpful to take stock of the history of the field, even as we may await its next act.
Speaking of Yoav's screed and Yann LeCun's resopnse, I think there is room for both of them to be right. It is perfectly fine for there to be a bazaar approach to research, and it is fine—and expected—to be highly critical of work that does not pass muster, either in terms of claims, methods or evaluation methodology.
On one particular point you made, Fernando: as I know you're well aware, the bean counter empiricists spent quite a while grappling with the expense of human labeling data (for tasks other than MT for which large amounts are available), hoping they would find that holy grail of semi-supervision, where a small amount of expensively obtained data can be combined with an arbitrarily large amount of unlabeled (and therefore cheap) data. Arguably, the deep learning "hordes" have an approach that comes closer than any other in the last few decades at achieving this aim. Transfer learning of this form is perhaps the biggest boon provided by deep learning methods to NLP, and there is clearly much more interesting work to do on the subject.
So, after all we are faced with philosophical rather than technical problems. Do you think philosophy (or even history of philosophy) should be a centric part of MEST curricula?
Great essay, Fernando! I still wonder which rationalist linguistic formalisms can be linked to cognitive mechanisms and which were simply brilliant rationalizations.
I think most of today's empiricists don't know what phenomena was even being modeled. Mining the old formalisms for insights that can be harvested or explained with the newer tools seems like a rich vein of research.
I really liked your example of how LSTM handled memory and long distance dependencies better than the methods computational linguists had been struggling with. I would love to see you write this story up in more detail.
Post a Comment