Saturday, February 3, 2007

Why NLP Is A Disruptive Force

Why NLP Is A Disruptive Force:
Fernando is still skeptical about the potential of NLP to play a major role in search.
I'm not skeptical about the potential of NLP. I'm skeptical about the approaches I'm reading and hearing about.
I may be putting words in Fernando's mouth, but I believe the reason he states this is because he is assessing its impact against the standard search interaction (type words in a box, get a list of URLs back). This is missing the point.
I not making that assumption. My group at AT&T Labs built one of the earliest Web question-answering systems, back in 1999, which identified interesting entities that might be answers to typed queries, as well as URLs for the pages that contained mentions of those entities. I understand the potential of answering queries with entities and relationships derived not only from text but also structured data. That's why, for example, I have contributed to the Fable system, which sorts through the genes and proteins mentioned in MEDLINE abstracts responsive to a search query, linking them to standard gene and protein databases.
When one is dealing with language, one is dealing at a higher level of abstraction. Rather than sequences of characters (or tokens - what we might rudely refer to as words) we are dealing with logical symbols. Rather than the primary relationships being before and after (as in this word is before that word) we can capture relationships that are either grammatical (pretty interesting) or semantic (extremely interesting). With this ability to transform simple text into logical representations one has (had to) resolve a lot of ambiguity.
That's exactly where we differ. We do not have yet “this ability to transform simple text into logical representations.” To the extent that our methods reach for that ability, they are very fragile and narrow. Current methods that are robust and general can only rely on shallow distributional properties of words and texts barely beyond bag-of-words search methods. That's why I believe that NLP successes will arise bottom-up, by improving current search where it can be improved (word-sense disambiguation, recognition of multiword terms, term normalization, recognition of simple entity attributes and relations), and not top-down with a total replacement of current search methods.
I'm claiming that changes to the back end will enable fundamental changes to how 'results' are served.
The Fable system I mentioned is a simple example of exactly that. A back-end entity tagger and normalizer recognizes mentions of genes and proteins. This is nontrivial because the same gene or protein may be mentioned using a variety of terms and abbreviations. We can aggregate the normalized mentions to rank genes/proteins according to their co-occurrence with search terms, and we can link the mentions to gene and protein databases. We are working to improve the system to support richer queries that find the genes and proteins involved in important biological relationships and processes. However, I do not believe that these particular methods generalize robustly and cost-effectively to more general search. In particular, some of the methods we use rely on supervised machine learning, which requires costly manual annotation of training corpora. That will not scale to the Web in general.

The situation gets much more complicated with “relationships that are either grammatical (pretty interesting) or semantic (extremely interesting).” Current large-coverage parsers make many mistakes (typically one per 10 words in newswire, one per 7 words in biomedical text). It gets much worse for semantic relationships. The best co-reference accuracy I know about is around 80%, that is, one in five references are made to the wrong antecedent. These mistakes compound. That is, the proposed back-end would crate a lot of trash, which would be difficult to separate from the good stuff. We can do interesting research on these issues, but we are very far from generic, robust, scalable solutions.
If you change the game (e.g. by changing the way results are provided) then the notion of quality has been disrupted. I'm not sure what the costs are that Fernando is referring to. CPU (e.g. time to process all content)? Response time?
Users have expectations of quality for current search. A “disruptive” search engine that does not meet those expectations while offering something more will be in trouble. As for costs, putting NLP algorithms in the back-end requires a lot more time to process the content and a lot more storage than current indexing methods. In addition, mistakes in back-end processing will be costly: if your parser or semantic interpreter messes up — and they all do, often — you have to reprocess a huge amount of material, which is not easy to do while trying to keep up with the growth of the Web. All of these costs have to be paid for with improved advertising revenue, or some new revenue source. I have not detailed data to come up with actual estimates, but I would be very surprised if the costs of a search engine with a NLP back-end were less than twice those of a state-of-the-art non-NLP search engine. And that's probably lowballing it a lot.

No comments: