Two quotes from Zellig Harris's Language and Information, which I keep coming back to when I am trying to figure out the confusions of natural language processing (NLP) and search. Discussing language in general:
But natural language has no external metalanguage. We cannot describe the structure of natural language in some other kind of system, for any system in which we could identify the elements and meanings of a given language would have to have already the same essential structure of words and sentences as the language to be described.
Discussing science sublanguages:
Though the sentences of a sublanguage are a subset of the sentences of, say, English, the grammar of the sublanguage is not a subgrammar of English. The sublanguage has important constraints which are not in the language: the particular word subclasses, and the particular sentence types made by these. And the language has important constraints which are not followed in the sublanguage. Of course, since the sentences of the sublanguage are also sentences of the language, they cannot violate the constraints of the language, but they can avoid the conditions that require those constraints. Such are the likelihood differences among arguments in respect to operators; those likelihoods may be largely or totally disregarded in sublanguages. Such also is the internal structure of phrases, which is irrelevant to their membership in a particular word class of a sublanguage (my emphasis).
Recently, we found clear empirical evidence for this last point, and indirect evidence for the more general point in the failure of several teams to achieve significant domain adaptation from newswire parsing to biochemical abstract parsing.
In general, discussions of natural language processing in search fail to distinguish between search in general text material and search in narrow technical domains. Both rule-based and statistical methods perform very differently in the two kinds of search, and the reason is implicit in Harris's analysis of the differences between general language and technical sublanguages: the very different distributional properties of general language and sublanguages.
Harris observed the very different distributions in general language and technical sublanguages. Although he didn't put it this way, the distributions in sublanguages are very sharp, light-tailed. In general language, they are heavy-tailed (Zipf). Both manual lexicon and rule construction methods and most of the machine learning methods applied to text fail to capture the long tail in general text. The paradoxical effect is that "deeper" analysis leads to more errors, because analysis systems are overconfident in their analysis and resulting classifications or rankings.
In contrast, in technical sublanguages there is a hope that both rule-based and machine learning methods can achieve very high coverage. Additional resources, such as reference book tables of contents, thesauri, and other hierarchical classifications provide relatively stable side information to help the automation. Recently, I had the opportunity to spend some time with Peter Jackson and his colleagues at Thomson and see some of the impressive results they have achieved in large-scale automatic classification of legal documents and in document recommendation. The law is very interesting in that it has a very technical core but it connects to just about any area of human activity and thus to a wide range of language. However, Harris's distributional observations still apply to the technical core, and can be exploited by skilled language engineers to achieve much better accuracy than would be possible with the same methods on general text.
More speculatively, the long tail in general language may have a lot to do with the statistical properties of the graph of relationships among words. Harris again:
At what point do words get meaning? One should first note something that may not be immediately obvious, and that is that meanings do not suffice to identify words, They can give a property to words that are already identified, but they don't identify words. Another way of saying this is that, as everybody who has used Roget's Thesaurus knows, there is no usable classification and structure of meanings per se, such that we could assign the words of a given language to an a priori organization of meanings. Meanings over the whole scope of language cannot arranged independently of the stock of words and their sentential relations. They can be set up independently only for kinship relations, for numbers, and for some other strictly organized parts of the perceived world.
Rule-based and parametric machine learning methods in NLP are based on the assumption that language can be "carved at the joints" and reduced to the free combination of a relatively small (to the number of distinct tokens) number of factors. Although David Weinberger in Everything is Miscellaneous does not write about NLP, his arguments are directly applicable here. Going further, to the extent to which general search works, it is because it is non-parametric: the ranking of documents in response to a query is mostly determined by the particular terms in the query and documents and their distributions, not by some parametric abstract model of ranking. If and when we can do machine learning and NLP this way accurately and efficiently, we may have a real hope of changing general search significantly. In the meanwhile, our parametric methods have a good chance in sublanguages that matter, like the law or biomedical. The work I mentioned already demonstrates this.