Saturday, June 10, 2017

A (computational) linguistic farce in three acts


I had not blogged for 3 years. Many plausible excuses, but the big reason is that it is easier to dash a tweet or a short incidental social media post than to structure a complex argument, which uses mental resources that I need full time at work. But the argument about deep learning, natural language, publication styles and venues that Yoav Goldberg posted on Medium reminded me of something that one day (not today) I would like to get to, the complicated, sometimes hazy, often contentious history of the science and engineering of language as a computational process (I know, I know, even putting it that way could trigger many social scientists and philosophers, but this is just a blog post, not a treatise).

I call this a farce not in a derogatory way, but for its many misunderstandings and pratfalls, in the best traditions of comedic theatre, opera, and silent movie. Who has not stepped on a rhetorical rake in heated academic discussion may cast the first water balloon. After all, these debates issued from the very serious work of intellectual giants of the 50s and early 60s: Kleene, Shannon, Harris, McCulloch, McCarthy, Minsky, Chomsky, Miller, ... One day I'd love to see a careful, thoughtful intellectual history of the origins of AI in general and of the computational turn in language in particular, but we don't have one, so I'm free to make up my own comedic version.

Act One: The (Weak) Empire of Reason

Much of the work on computational models of language and language processing until the 80s was based on an implicit or explicit hope that relatively simple algorithms would capture much of what mattered. Researchers (including me) created models and algorithms that claimed to capture the “essential” phenomena in a modular, compositional way. Once that was done, practical applications would follow easily, since the nice combinatorics of compositionality would cover the infinitely many ways people express meaning.

That was nice, but there was the nagging problem that none of those models or systems could parse, let alone usefully interpret most of the language occurring in the wild. Even back then, artificial neural network fans argued that those crisp formal models of language failed because they did not have enough “flex” at their joints. That led to some epic food fights, but the reality is that NN models, algorithms, but mostly the puny computers and datasets we played with back then could not even match those carefully handcrafted rule systems.

One (temporary) escape from this mismatch between models and actual language was to turn ourselves into formal linguists (I did that too) and argue that we were using computational tools to investigate the core of language, leaving that wild mess of actual language for later decades when we'd have finally dug up the keys to the treasure house. This was a nice detour for both symbolic and neural-network researchers, and it had a not totally unreasonable methodological defense in that, say, physicists also investigated simplified systems (oh, that physics envy!) Of course, this sidestepped the uncomfortable feeling that language, as an evolved biological and social phenomenon, might not have a simple description at all. Incidentally, there's a bit of a parallel here with how biology and biomedical research went on a “simplicity” trip after the discovery of the genetic code (and even later after the sequencing of the human genome), only to keep being foiled by the daunting mess that evolution has left us. Nevertheless, I'd still argue that some of the descriptive models of language developed then still capture the range of certain actual combinatorial possibilities in language at a level of detail that has not been bested. That's mainly a story for another time, except that the lure of simplified settings and models comes back in Act Three.

The field was very small back then. Everyone knew everyone, even those who might despise each other's work and say it loudly in ACL question periods. As a result, a few powerful arbiters of research taste set the tone for each sub-community. Combined with the limited means of research circulation then, that led to small, cohesive cliques. When such a group captured control of research resources (funding, plum academic or industrial roles), as did happen, alternative ideas did not have much room to grow.

Act Two: The Empiricist Invasion or, Who Pays the Piper Calls the Tune

The not insignificant research funding that computational research on language had received from the late 70s to the late 80s, combined with changes in research funding climate (a whole interesting story in itself, but too long and twisted to go into here) created an opening for bold invaders to convince funders that the Emperor of Reason had been committing research in the altogether.

The empiricist invaders were in their way heirs to Shannon, Turing, Kullback, I.J. Good who had been plying an effective if secretive trade at IDA and later at IBM and Bell Labs looking at speech recognition and translation as cryptanalysis problems (The history of the road from Bletchley Park to HMMs to IBM Model 2 is still buried in the murk of not fully declassified materials, but it would be awesome to write — I just found this about the early steps that could be a lot of fun). They convinced funders, especially at DARPA, that the rationalist empire was hollow and that statistical metrics on (supposedly) realistic tasks were needed to drive computational language work to practical success, as had been happening in speech recognition (although by the light of today, that speech recognition progress was less impressive than it seemed then). It did not hurt the campaign that many of the invaders were closer to the DoD in their backgrounds, careers, and outlooks than egghead computational linguists (another story that could be expanded, but might make some uncomfortable). Anyway, I was there in meetings where the empiricist invaders allied with funders increasingly laid down the new rules of the game. Like in the Norman invasion of England, a whole new vocabulary took over quickly with the new aristocracy.

In hindsight, the campaign of 1987-89 and the resulting new order were quite entertaining (even if they did not seem so to the invaded at the time) and brought new cultural devices that were objectively more effective in defining measurable progress, if quite stressful for funding recipients. Personally, I had already started my own journey from somewhat skeptical rationalism to somewhat skeptical empiricism, and would leave the Government-funded research world for the next 12 years, so the conflict was a great opportunity to develop a more distanced view of both the old and the new culture.

The empiricist ascendancy was fortunate (or prescient) in riding the growth of computing resources and text data that also enabled the Web explosion and the flooding of all this work with new resources for funding research, software development, and corpus creation. The metrics religion helped funders sell progress to the holders of the purse strings, and there were real (if not as extensive as sometimes claimed) practical benefits, especially in speech recognition and machine translation. As a result, the research community grew a lot (my top-of-the-head estimate is around 5x from 1990 to 2010).

One curious byproduct of the empiricist ascendancy that is relevant to the present conflict is that measurement became a virtue in itself, sometimes quite independently of what was really being measured. Many empiricist true believers just want the numbers, regardless of whether they correspond to anything relevant to actual language structure and use. Although Penn Treebank metrics are most often brought up for this critique, there are much worse offenders that I will omit in the interests of both not offending by naming without proper arguments, and of not spending my whole weekend on this. In summary, a certain metric fetishism arose that still prevails today for instance in conference reviewing, with the result that interesting models and observations are dismissed unless they improve one of the blessed metrics. Metrics became publishing gatekeepers, easy to apply without thinking, and promoting a kind of p-hacking culture that demeaned explanation and error analysis. Worst, for a practitioner, was that all the metrics are averages, when large deviations is what really matters if you are responsible for a product that should have very low chance of doing something really bad.

Which brings us to the final act.

Act Three: The Invaders get Invaded or, The Revenge of the Spherical Cows

Empiricist practice has a deep flaw that is rarely discussed. At its origins, cryptanalysts were working on sources very restricted in content, and vocabulary. After all, it was not so likely that Enigma traffic would have much outside the military order of the day, weather, and the like, or go full Jabberwocky. When you get enough token counts for traffic like that, you pretty much know everything you can know. It's what Harris profoundly noted in the differences between technical languages and general language. Popular tasks of the empiricist era, from ATIS to PTB, were similarly restricted (travel, business news, ...). What this means is that typical count-based empiricist methods do much better on their own benchmarks than in real life. Just try to parse the Web (let alone social media or chat) with a PTB parser to see what I mean.

Where a lot of training data can be collected in the wild — most notably, parallel translation corpora — empiricist practice with enough counts limps along the long tail, although it's touch-and-go when the counts get small, as they always do.

Another way to see this is that empiricist dogma was protected from its own demise by a rather convenient choice of evaluations. Those of us who have worked hard to apply these methods to real data know well the struggle with small token counts, and the dismaying realization that fancier statistical methods (such as latent-variable models) are most often a waste of effort because in practical situations, a flat count-based model (or linear models) can do as well as can be hoped with the data at hand. Those of us who thought a bit more about this started to realize that token counting and its variants could not generalize effectively across “similar” tokens. We tried many different recipes to alleviate the problem (eg. class-based language models), but they were all ineffective or computationally infeasible (I know, I co-authored quite a few papers in that mode, including at least a couple of best papers — goes to show the limited horizons of program committees).

It's here that deep learners carrying warming GPUs descended from the Northern wastes to lay siege to the empiricist (bean) counters and their sacred metrics. First language modeling, then machine translation fell, in no little measure thanks to their ability to learn usage and meaning generalizations much better than counting could. The modularity of NN models made it easier to explore model design space. Recurrent gated models (thank you Hochreiter and Schmidhuber!) managed history in a much more flexible and adaptable way than any of the history-counting tricks of the previous two decades. It was a rout.

The excitement of the advance was irresistible. Researchers involved, experiments, and papers grew fast, in my estimate 4x from 2010 to 2017. Publication venues overflowed, and researchers burning ever more fossil fuel with their GPUs (morality tale warning!) turned to arXiv to plant ever more flags on their marches through newly conquered lands (not the most culturally apt of behaviors, it must be admitted).

But was the invasion so glorious? Very few of the standard tasks have the very large training sets of language modeling or translation that large-scale SGD depends on. Some tasks with carefully created training sets, like parsing, showed significant but not as striking gains with deep learning. There are exciting results in transfer learning (such as the zero shot translation results), but they rely in starting from models trained on a whole lot of data. However, when we get to tasks for which we have only evaluation data, where count-based models can still do decent work (clustering, generative models), deep learning does not have yet a superior answer.

For continuous outputs, GANs have made a lot of progress. At least, the pictures are stunning. But as I discovered when I worked on distributional clustering, stunning is very much in the eye of the beholder. Proxies like the word association tasks so popular in evaluating word embeddings are almost embarrassingly low-discrimination compared with the size of the models being evaluated. Sensible ways of evaluating GANs for text are even scarcer.

In defense of the Northern hordes, the empiricist burn-it-to-the-ground campaign left little standing that could promote a new way of life. Once the famous empiricist redoubts are conquered or at least laid siege to, how does the campaign continue?

Idea! Let's go back to toy problems where we can create the test conditions easily, like the rationalists did back then (even if we don't realize we are imitating them). After all, Atari is not real life, but it still demonstrates remarkable RL progress. Let's make the Ataris of natural language!

But now the rationalists converted to empiricism (with the extra enthusiasm of the convert) complain bitterly. Not fair, Atari is not real life!

Of course it is not. But neither is PTB, nor any of the standard empiricist tasks, which try strenuously to imitate wild language (their funding depends on it!) but really fail, as Harris predicted back in the 1950s. Or even the best of descriptive linguistics, which leaves in the murk all those messy deviations from the nice combinatorics of the descriptive model.


The mysticism of Mozart's The Magic Flute makes me queasy, and honestly the opera is longer than it should be (at least on an uncomfortable concert hall seat). But the music, and the ultimate message! The main protagonists struggle for and eventually reach enlightenment along their different paths. We are very far from Dann ist die Erd' ein Himmelreich, und Sterbliche den Göttern gleich (thank you neural MT for checking my quotation from the original), but we have been struggling long enough in our own ways to recognize the need for coming together with better ways of plotting our progress.


Chris Brew said...

I like Müller's (briefly) brave traveller from Winterreise a lot more than Schikenader's
archetypes. And the music is just as good, though in a different way.


Fliegt der Schnee mir ins Gesicht,
Schüttl' ich ihn herunter.
Wenn mein Herz im Busen spricht,
Sing' ich hell und munter.

Höre nicht, was es mir sagt,
Habe keine Ohren;
Fühle nicht, was es mir klagt,
Klagen ist für Toren.

Lustig in die Welt hinein
Gegen Wind und Wetter !
Will kein Gott auf Erden sein,
Sind wir selber Götter !

Peter Norvig said...

A decade ago, Ken Church talked about the rationalism/empiricism "pendulum," in his words. I wonder what he would say after another swing of the pendulum overt the past decade?

Peter Norvig said...

Ken Church Link:

Unknown said...

Quite excellent sir! I've pondered how the varied understandings within humanity (in time, space, and experience) of the word 'apple'. "AI" still feels like interacting with children or the inexperienced. What will it take for people to find technologies (magic?) to be on par with our complexity​?

Unknown said...

Beautiful essay, so very true.

Vijay Saraswat said...

You got to get on FB, Fernando! Thats where all the conversations are :-).

My response:

Yann LeCun's response to Yoav's post:

Richard Sproat said...

I worry most about the training that students are getting these days. I commented on this issue 8 years ago ( in the pre-deep learning days, when it seemed that already the field of NLP was turning into Applied Machine Learning. Indeed I have been calling it that ever since. Others (e.g. Kevin Knight) had made similar comments about students who know everything about MaxEnt, but don't know what a noun phrase is. This will presumably only get worse before it gets better.

The easiest way for the pendulum to swing back would be for neural systems to fail to do what they are supposed to do. In text-to-speech, for example, where a fully neural system reads text in an embarrassing or incomprehensible way. It won't be so easy to repair that by simply providing yet more data, with any kind of guarantee that the problem will be fixed, or that new problems would not be introduced.

Conser said...

What about connecting language to the rest of cognition? Studying it in isolation only goes so far. I think James Allen's work, and the Open Dialogue work at MSR, are two examples of language in contexts that provide good laboratories for making progress. Learning by reading, evaluated via human-normed tests, is another. We evaluate our LbR systems by question-answering as much as possible, since that's ultimately one of the things that one wants to do with language.

Kenneth Church said...

I'm working on a reply in my next opinion piece for NLE. The last one was:

Short answer: we basically agree on the past, but the future is harder to get right. There hasn't been as much movement toward rationalism as I had expected, and teaching has become even more focused on whatever happens to be hot recently, as Richard points out below.

I heard a really interesting interview on NPR recently about strong opinions and content-free debate.

"You know, for me personally, you know, I grew up - I was in high school when this war was going on. I used to have these knock-down, drag-out fights with my dad over the war. I was against it, and he was in favor of it. And neither of us knew enough to have a really strong opinion. But we had them anyway."

The author then goes on to credit his father with teaching him to do his homework and get the facts, and doing so taught him to be successful in his career. It is ok to reject the classics, but not without reading them. Shameless plug for my course at Columbia in the fall: We will read the classics (and maybe reject them).

Unknown said...

Richard: What's a noun phrase? :P

Seriously, Fernando, great essay. I'm grateful you were able to post this as a reasonable account in the face of the heightened emotions both on display and generated by Yoav's screed. It is helpful to take stock of the history of the field, even as we may await its next act.

Speaking of Yoav's screed and Yann LeCun's resopnse, I think there is room for both of them to be right. It is perfectly fine for there to be a bazaar approach to research, and it is fine—and expected—to be highly critical of work that does not pass muster, either in terms of claims, methods or evaluation methodology.

On one particular point you made, Fernando: as I know you're well aware, the bean counter empiricists spent quite a while grappling with the expense of human labeling data (for tasks other than MT for which large amounts are available), hoping they would find that holy grail of semi-supervision, where a small amount of expensively obtained data can be combined with an arbitrarily large amount of unlabeled (and therefore cheap) data. Arguably, the deep learning "hordes" have an approach that comes closer than any other in the last few decades at achieving this aim. Transfer learning of this form is perhaps the biggest boon provided by deep learning methods to NLP, and there is clearly much more interesting work to do on the subject.

=CitizenGreek= said...

So, after all we are faced with philosophical rather than technical problems. Do you think philosophy (or even history of philosophy) should be a centric part of MEST curricula?

Barney said...

Great essay, Fernando! I still wonder which rationalist linguistic formalisms can be linked to cognitive mechanisms and which were simply brilliant rationalizations.

I think most of today's empiricists don't know what phenomena was even being modeled. Mining the old formalisms for insights that can be harvested or explained with the newer tools seems like a rich vein of research.

I really liked your example of how LSTM handled memory and long distance dependencies better than the methods computational linguists had been struggling with. I would love to see you write this story up in more detail.