Empiricist practice has a deep flaw that is rarely discussed. At its origins, cryptanalysts were working on sources very restricted in content, and vocabulary. After all, it was not so likely that Enigma traffic would have much outside the military order of the day, weather, and the like, or go full Jabberwocky. When you get enough token counts for traffic like that, you pretty much know everything you can know. It's what Harris profoundly noted in the differences between technical languages and general language. Popular tasks of the empiricist era, from ATIS to PTB, were similarly restricted (travel, business news, ...). What this means is that typical count-based empiricist methods do much better on their own benchmarks than in real life. Just try to parse the Web (let alone social media or chat) with a PTB parser to see what I mean.
Where a lot of training data can be collected in the wild — most notably, parallel translation corpora — empiricist practice with enough counts limps along the long tail, although it's touch-and-go when the counts get small, as they always do.
Another way to see this is that empiricist dogma was protected from its own demise by a rather convenient choice of evaluations. Those of us who have worked hard to apply these methods to real data know well the struggle with small token counts, and the dismaying realization that fancier statistical methods (such as latent-variable models) are most often a waste of effort because in practical situations, a flat count-based model (or linear models) can do as well as can be hoped with the data at hand. Those of us who thought a bit more about this started to realize that token counting and its variants could not generalize effectively across “similar” tokens. We tried many different recipes to alleviate the problem (eg. class-based language models), but they were all ineffective or computationally infeasible (I know, I co-authored quite a few papers in that mode, including at least a couple of best papers — goes to show the limited horizons of program committees).
It's here that deep learners carrying warming GPUs descended from the Northern wastes to lay siege to the empiricist (bean) counters and their sacred metrics. First language modeling, then machine translation fell, in no little measure thanks to their ability to learn usage and meaning generalizations much better than counting could. The modularity of NN models made it easier to explore model design space. Recurrent gated models (thank you Hochreiter and Schmidhuber!) managed history in a much more flexible and adaptable way than any of the history-counting tricks of the previous two decades. It was a rout.
The excitement of the advance was irresistible. Researchers involved, experiments, and papers grew fast, in my estimate 4x from 2010 to 2017. Publication venues overflowed, and researchers burning ever more fossil fuel with their GPUs (morality tale warning!) turned to arXiv to plant ever more flags on their marches through newly conquered lands (not the most culturally apt of behaviors, it must be admitted).
But was the invasion so glorious? Very few of the standard tasks have the very large training sets of language modeling or translation that large-scale SGD depends on. Some tasks with carefully created training sets, like parsing, showed significant but not as striking gains with deep learning. There are exciting results in transfer learning (such as the zero shot translation results), but they rely in starting from models trained on a whole lot of data. However, when we get to tasks for which we have only evaluation data, where count-based models can still do decent work (clustering, generative models), deep learning does not have yet a superior answer.
For continuous outputs, GANs have made a lot of progress. At least, the pictures are stunning. But as I discovered when I worked on distributional clustering, stunning is very much in the eye of the beholder. Proxies like the word association tasks so popular in evaluating word embeddings are almost embarrassingly low-discrimination compared with the size of the models being evaluated. Sensible ways of evaluating GANs for text are even scarcer.
In defense of the Northern hordes, the empiricist burn-it-to-the-ground campaign left little standing that could promote a new way of life. Once the famous empiricist redoubts are conquered or at least laid siege to, how does the campaign continue?
Idea! Let's go back to toy problems where we can create the test conditions easily, like the rationalists did back then (even if we don't realize we are imitating them). After all, Atari is not real life, but it still demonstrates remarkable RL progress. Let's make the Ataris of natural language!
But now the rationalists converted to empiricism (with the extra enthusiasm of the convert) complain bitterly. Not fair, Atari is not real life!
Of course it is not. But neither is PTB, nor any of the standard empiricist tasks, which try strenuously to imitate wild language (their funding depends on it!) but really fail, as Harris predicted back in the 1950s. Or even the best of descriptive linguistics, which leaves in the murk all those messy deviations from the nice combinatorics of the descriptive model.