Saturday, June 10, 2017

A (computational) linguistic farce in three acts

Prologue

I had not blogged for 3 years. Many plausible excuses, but the big reason is that it is easier to dash a tweet or a short incidental social media post than to structure a complex argument, which uses mental resources that I need full time at work. But the argument about deep learning, natural language, publication styles and venues that Yoav Goldberg posted on Medium reminded me of something that one day (not today) I would like to get to, the complicated, sometimes hazy, often contentious history of the science and engineering of language as a computational process (I know, I know, even putting it that way could trigger many social scientists and philosophers, but this is just a blog post, not a treatise).

I call this a farce not in a derogatory way, but for its many misunderstandings and pratfalls, in the best traditions of comedic theatre, opera, and silent movie. Who has not stepped on a rhetorical rake in heated academic discussion may cast the first water balloon. After all, these debates issued from the very serious work of intellectual giants of the 50s and early 60s: Kleene, Shannon, Harris, McCulloch, McCarthy, Minsky, Chomsky, Miller, ... One day I'd love to see a careful, thoughtful intellectual history of the origins of AI in general and of the computational turn in language in particular, but we don't have one, so I'm free to make up my own comedic version.

Act One: The (Weak) Empire of Reason

Much of the work on computational models of language and language processing until the 80s was based on an implicit or explicit hope that relatively simple algorithms would capture much of what mattered. Researchers (including me) created models and algorithms that claimed to capture the “essential” phenomena in a modular, compositional way. Once that was done, practical applications would follow easily, since the nice combinatorics of compositionality would cover the infinitely many ways people express meaning.

That was nice, but there was the nagging problem that none of those models or systems could parse, let alone usefully interpret most of the language occurring in the wild. Even back then, artificial neural network fans argued that those crisp formal models of language failed because they did not have enough “flex” at their joints. That led to some epic food fights, but the reality is that NN models, algorithms, but mostly the puny computers and datasets we played with back then could not even match those carefully handcrafted rule systems.

One (temporary) escape from this mismatch between models and actual language was to turn ourselves into formal linguists (I did that too) and argue that we were using computational tools to investigate the core of language, leaving that wild mess of actual language for later decades when we'd have finally dug up the keys to the treasure house. This was a nice detour for both symbolic and neural-network researchers, and it had a not totally unreasonable methodological defense in that, say, physicists also investigated simplified systems (oh, that physics envy!) Of course, this sidestepped the uncomfortable feeling that language, as an evolved biological and social phenomenon, might not have a simple description at all. Incidentally, there's a bit of a parallel here with how biology and biomedical research went on a “simplicity” trip after the discovery of the genetic code (and even later after the sequencing of the human genome), only to keep being foiled by the daunting mess that evolution has left us. Nevertheless, I'd still argue that some of the descriptive models of language developed then still capture the range of certain actual combinatorial possibilities in language at a level of detail that has not been bested. That's mainly a story for another time, except that the lure of simplified settings and models comes back in Act Three.

The field was very small back then. Everyone knew everyone, even those who might despise each other's work and say it loudly in ACL question periods. As a result, a few powerful arbiters of research taste set the tone for each sub-community. Combined with the limited means of research circulation then, that led to small, cohesive cliques. When such a group captured control of research resources (funding, plum academic or industrial roles), as did happen, alternative ideas did not have much room to grow.

Act Two: The Empiricist Invasion or, Who Pays the Piper Calls the Tune

The not insignificant research funding that computational research on language had received from the late 70s to the late 80s, combined with changes in research funding climate (a whole interesting story in itself, but too long and twisted to go into here) created an opening for bold invaders to convince funders that the Emperor of Reason had been committing research in the altogether.

The empiricist invaders were in their way heirs to Shannon, Turing, Kullback, I.J. Good who had been plying an effective if secretive trade at IDA and later at IBM and Bell Labs looking at speech recognition and translation as cryptanalysis problems (The history of the road from Bletchley Park to HMMs to IBM Model 2 is still buried in the murk of not fully declassified materials, but it would be awesome to write — I just found this about the early steps that could be a lot of fun). They convinced funders, especially at DARPA, that the rationalist empire was hollow and that statistical metrics on (supposedly) realistic tasks were needed to drive computational language work to practical success, as had been happening in speech recognition (although by the light of today, that speech recognition progress was less impressive than it seemed then). It did not hurt the campaign that many of the invaders were closer to the DoD in their backgrounds, careers, and outlooks than egghead computational linguists (another story that could be expanded, but might make some uncomfortable). Anyway, I was there in meetings where the empiricist invaders allied with funders increasingly laid down the new rules of the game. Like in the Norman invasion of England, a whole new vocabulary took over quickly with the new aristocracy.

In hindsight, the campaign of 1987-89 and the resulting new order were quite entertaining (even if they did not seem so to the invaded at the time) and brought new cultural devices that were objectively more effective in defining measurable progress, if quite stressful for funding recipients. Personally, I had already started my own journey from somewhat skeptical rationalism to somewhat skeptical empiricism, and would leave the Government-funded research world for the next 12 years, so the conflict was a great opportunity to develop a more distanced view of both the old and the new culture.

The empiricist ascendancy was fortunate (or prescient) in riding the growth of computing resources and text data that also enabled the Web explosion and the flooding of all this work with new resources for funding research, software development, and corpus creation. The metrics religion helped funders sell progress to the holders of the purse strings, and there were real (if not as extensive as sometimes claimed) practical benefits, especially in speech recognition and machine translation. As a result, the research community grew a lot (my top-of-the-head estimate is around 5x from 1990 to 2010).

One curious byproduct of the empiricist ascendancy that is relevant to the present conflict is that measurement became a virtue in itself, sometimes quite independently of what was really being measured. Many empiricist true believers just want the numbers, regardless of whether they correspond to anything relevant to actual language structure and use. Although Penn Treebank metrics are most often brought up for this critique, there are much worse offenders that I will omit in the interests of both not offending by naming without proper arguments, and of not spending my whole weekend on this. In summary, a certain metric fetishism arose that still prevails today for instance in conference reviewing, with the result that interesting models and observations are dismissed unless they improve one of the blessed metrics. Metrics became publishing gatekeepers, easy to apply without thinking, and promoting a kind of p-hacking culture that demeaned explanation and error analysis. Worst, for a practitioner, was that all the metrics are averages, when large deviations is what really matters if you are responsible for a product that should have very low chance of doing something really bad.

Which brings us to the final act.

Act Three: The Invaders get Invaded or, The Revenge of the Spherical Cows

Empiricist practice has a deep flaw that is rarely discussed. At its origins, cryptanalysts were working on sources very restricted in content, and vocabulary. After all, it was not so likely that Enigma traffic would have much outside the military order of the day, weather, and the like, or go full Jabberwocky. When you get enough token counts for traffic like that, you pretty much know everything you can know. It's what Harris profoundly noted in the differences between technical languages and general language. Popular tasks of the empiricist era, from ATIS to PTB, were similarly restricted (travel, business news, ...). What this means is that typical count-based empiricist methods do much better on their own benchmarks than in real life. Just try to parse the Web (let alone social media or chat) with a PTB parser to see what I mean.

Where a lot of training data can be collected in the wild — most notably, parallel translation corpora — empiricist practice with enough counts limps along the long tail, although it's touch-and-go when the counts get small, as they always do.

Another way to see this is that empiricist dogma was protected from its own demise by a rather convenient choice of evaluations. Those of us who have worked hard to apply these methods to real data know well the struggle with small token counts, and the dismaying realization that fancier statistical methods (such as latent-variable models) are most often a waste of effort because in practical situations, a flat count-based model (or linear models) can do as well as can be hoped with the data at hand. Those of us who thought a bit more about this started to realize that token counting and its variants could not generalize effectively across “similar” tokens. We tried many different recipes to alleviate the problem (eg. class-based language models), but they were all ineffective or computationally infeasible (I know, I co-authored quite a few papers in that mode, including at least a couple of best papers — goes to show the limited horizons of program committees).

It's here that deep learners carrying warming GPUs descended from the Northern wastes to lay siege to the empiricist (bean) counters and their sacred metrics. First language modeling, then machine translation fell, in no little measure thanks to their ability to learn usage and meaning generalizations much better than counting could. The modularity of NN models made it easier to explore model design space. Recurrent gated models (thank you Hochreiter and Schmidhuber!) managed history in a much more flexible and adaptable way than any of the history-counting tricks of the previous two decades. It was a rout.

The excitement of the advance was irresistible. Researchers involved, experiments, and papers grew fast, in my estimate 4x from 2010 to 2017. Publication venues overflowed, and researchers burning ever more fossil fuel with their GPUs (morality tale warning!) turned to arXiv to plant ever more flags on their marches through newly conquered lands (not the most culturally apt of behaviors, it must be admitted).

But was the invasion so glorious? Very few of the standard tasks have the very large training sets of language modeling or translation that large-scale SGD depends on. Some tasks with carefully created training sets, like parsing, showed significant but not as striking gains with deep learning. There are exciting results in transfer learning (such as the zero shot translation results), but they rely in starting from models trained on a whole lot of data. However, when we get to tasks for which we have only evaluation data, where count-based models can still do decent work (clustering, generative models), deep learning does not have yet a superior answer.

For continuous outputs, GANs have made a lot of progress. At least, the pictures are stunning. But as I discovered when I worked on distributional clustering, stunning is very much in the eye of the beholder. Proxies like the word association tasks so popular in evaluating word embeddings are almost embarrassingly low-discrimination compared with the size of the models being evaluated. Sensible ways of evaluating GANs for text are even scarcer.

In defense of the Northern hordes, the empiricist burn-it-to-the-ground campaign left little standing that could promote a new way of life. Once the famous empiricist redoubts are conquered or at least laid siege to, how does the campaign continue?

Idea! Let's go back to toy problems where we can create the test conditions easily, like the rationalists did back then (even if we don't realize we are imitating them). After all, Atari is not real life, but it still demonstrates remarkable RL progress. Let's make the Ataris of natural language!

But now the rationalists converted to empiricism (with the extra enthusiasm of the convert) complain bitterly. Not fair, Atari is not real life!

Of course it is not. But neither is PTB, nor any of the standard empiricist tasks, which try strenuously to imitate wild language (their funding depends on it!) but really fail, as Harris predicted back in the 1950s. Or even the best of descriptive linguistics, which leaves in the murk all those messy deviations from the nice combinatorics of the descriptive model.

Epilogue

The mysticism of Mozart's The Magic Flute makes me queasy, and honestly the opera is longer than it should be (at least on an uncomfortable concert hall seat). But the music, and the ultimate message! The main protagonists struggle for and eventually reach enlightenment along their different paths. We are very far from Dann ist die Erd' ein Himmelreich, und Sterbliche den Göttern gleich (thank you neural MT for checking my quotation from the original), but we have been struggling long enough in our own ways to recognize the need for coming together with better ways of plotting our progress.