Thursday, February 15, 2007

Back To The Future: NLP, Search, Google and Powerset

Back To The Future: NLP, Search, Google and PowersetBattelle then goes on to speculate about how these capabilities might surface in the Google UI. The last sentence in the above quote seems so close - at least in terms of vision - with some of the current wave of NLP search debate that is provokes the question: what happened to this project? Did Google try and fail? If you read it closely, you'll see that Norvig is talking about some key NLP concepts:
  • Entities (typed concepts expressed in short spans of text, generaly noun phrases)
  • Ontologies (Java IS_A programming language)
  • Relationships (between entities)
I mean - couldn't you build a next gen search engine on such wonderful ideas?

I have no privileged information on what any search engine is doing in this area. But I've been doing research on entity extraction, parsing, and some relation extraction for the last eight years. It is very hard to create general, robust entity extractors. The most accurate current methods depend on labeled training data, and do not transfer well to domains different from those in which they were trained. For instance, a gene/protein extractor that performs at a state-of-the-art level on biomedical abstract does terribly on biomedical patents (we did the experiment). Methods that do not depend on labeled data are less domain dependent, but do not do very well on anything.

Matt has proposed before that redundancy — the same information presented many times — can be exploited to correct the mistakes individual components. I noted that this argument ignores the long tail. We cannot recover from mistakes on entities that occur just a few times, but unfortunately those are the entities that are often most important, in particular in technical domains. The common entities and facts are known already, it's the long tail of rarer, less-known ones ones that could lead to a new hypothesis.

I don't believe that there is a secret helicopter to lift us all over the long uphill slog of experimenting with incrementally better extraction methods, occasionally discovering a surprising nugget, such as a method for adapting extractors between domains more effectively. This is how speech recognition grew from a lab curiosity to a practically useful if still limited technology, and I see no reason why Matt's three bullets above should be any different.

I am enthusiastic about research in this area, because I've seen significant progress in the last ten years. But I'm not convinced that the current methods are nearly general enough for a "next gen search engine." Matt's three bullets are not yet "wonderful ideas" ready for deployment, just "promising research areas." We should not forget the "AI winter" of 20 years ago that followed much over-promising and under-delivering, and quite a bit of investor disappointment.

No comments: