One of the basic paradigms of text mining, and a simple though constraining architectural paradigm, is the one document at a time pipeline. A document comes in, the machinery turns, and results pop out. However, this is limiting. It fails to leverage redundancy - the great antidote to the illusion that perfection is required at every step.
This is a puzzling assertion. Search ranking techniques like TFIDF and PageRank work on the collection as a whole, and exploit redundancy by aggregating term occurrences and links. Current text-mining pipelines look at extracted elements as a whole for reference resolution, for instance. Everyone I know working on these questions is working hard to exploit redundancy as much as possible. However, I still believe what I wrote in a review paper seven years ago:
While language may be redundant with respect to any particular question, and a task-oriented learner may benefit greatly from that redundancy [...], it does not follow that language is redundant with respect to the set of all questions that a language user may need to decide.
Matt then raises the critical issue of a system's confidence in its answers:
The key to cracking the problem open is the ability to measure, or estimate, the confidence in the results. With this in mind, given 10 different ways in which the same information is presented, one should simply pick the results which are associated with the most confident outcome - and possibly fix the other results in that light.
The fundamental question then is whether confidence can be reliably estimated when we are dealing with heavy-tailed distributions. However large the corpus, most questions have very few responsive answers, and estimating confidence from very few instances is problematic. In other words, redundancy is much less effective when you are dealing with very specific queries, which are an important fraction of all queries, and those for which NLP would be most useful if it could be used reliably. This is also one of the reasons why speaker-independent large-vocabulary speech recognition with current methods is so hard: however big a corpus you have for language modeling, many of the events you care about do not occur often enough to yield reliable probabilities. Dictation systems work because they can adapt to the speech and content of interest or a single user. But search engines have to respond to all comers.
And as for the issue of 'understanding every query' this is where the issue of what Barney calls a grunting pidgin language comes in. For example, I saw recently someone landing on this blog via the query - to Google - 'youtube data mining'. As the Google results page suggested, this cannot be 'understood' in the same way that a query like 'data mining over youtube data' can. Does the user want to find out about data mining over YouTube data, or a video on YouTube about data mining?
That's a cute example, but Matt forgets the more likely natural-language query, 'data mining in youtube', which is a nice grammatical noun phrase ambiguous in exactly the way he describes. Language users exploit much shared knowledge and context to understand such ambiguous queries. Even the most optimistic proponents of current NLP methods would be hard-pressed to argue that the query I suggested and its multitude of relatives can be disambiguated reliably by their methods. Sure, you could argue that users will learn to be more careful with their language as Matt suggests, but all the evidence from the long line of work on natural language interfaces to databases from the early 70s to the early 90s suggests that is not the case. Our knowledge of language is mostly implicit, and it is difficult even for professional linguists to identify all the possible analyses of a complex phrase, let alone all of its possible interpretations in context. That makes difficult for a user of a language-interpretation system to figure out how to reformulate a query to coax the system toward the corner of semantic space they have in mind — if they can even articulate what they have in mind.
So what's to be done?
- Shallow NLP methods can be effective in recognizing specific types of entities and relationships that can improve search. I mentioned an example from my work before, but a lot more is possible and will be exploited over the next few years. Global inference methods for disambiguation and reference resolution are starting to be quite promising.
- In the medium term, there might be reliable ways to move from 'bags of words' to 'bags of features' that include contextual, syntactic and term distribution evidence. The rough analog here is BLAST, which allows you to search efficiently for biological sequences that approximately match a pattern of interest, except that the a pattern would be a multidimensional representation of a query.
- There are many difficult longer-term research questions in this area, but underlying many of them is the single question of how to do reliable inference with heavy-tailed data. Somehow, we need to be able to look at the data at different scales so that rare events are aggregated into more frequent clusters as needed for particular questions; a single clustering granularity is not enough.