Monday, January 1, 2007

Powerset In The New York Times

Powerset In The New York Times: A nice little article summarizing the playing field for novel search going in to 2007. (Via DataMining).

It's good to see Barney and his colleagues in the Times. However, I didn't think much of the article. As is unfortunately common in the MSM, there is no substance in the story, except for who invested and how much. What is "natural language search," (NLS) in terms that would make sense to the average reader of the business section of the Times? If current search engines do not use NLS, it it just because they are too fat and distracted? Or are there technical, let alone scientific reasons for the lack of NLS? The writer missed the opportunity to illustrate the issues and challenges with some concrete examples, for instance some of those that Barney discussed in his blog a while ago.

The fundamental question about NLS is whether the potential gains from a deeper analysis of queries and indexed documents are greater than the losses from getting the analysis wrong. The history of using deeper analysis to improve speech recognition accuracy does not give much cause for optimism. Even after many years of effort, the improvements are modest at best. And language modeling for speech recognition is a pretty simple task compared with answering natural language queries, which may require deeper inference involving a wide range of background knowledge.

A second question is whether users would be happy with NLS. Natural language is what we use to communicate with each other. Our use of natural language involves subtle expectations about our interlocutors, including their ability to talk back intelligently. If the interlocutor doesn't seem to keep up with expectations, we may prefer a simpler mode, maybe more predictable mode of communication.

I believe that natural-language processing (NLP) can help improve search even in the short-to-medium term. But those improvements are more likely to be incremental, as for instance when search engines become better able to recognize a wider range of entities and relationships in indexed pages that can be used to answer queries more precisely. As search engines start moving in this direction, users may gain confidence in the effectiveness of richer queries. NLS will be the end result of a long evolution, not something completely designed from the beginning.

Yes, I've been reading yet another book on evolution, Sean B. Carroll's The Making of the Fittest. Computer scientists and software engineers can benefit a lot from reflecting on the evolutionary processes that led to the most complex and adaptable information processing systems known. We still believe too much in upfront design, and not enough in quick search, testing, and selection.

1 comment:

Anonymous said...

Great posts Fernando. I found your blog per Matt Hurst's posts relating to your comments. After reading backwards through your posts, I found myself wanting to introduce a similar response that I posted on Matt's blog about NLP systems. So here it is:

"I'd argue that there is a temporal axis to relevancy that is not being considered by NLP nor many other systems (I only know of one that does consider this). The temporal axis would help address finding the answer to a question like "show me the baddest movies of 1950". Had this question been asked in 1950 the answer would have been different than when asked today. The term "bad" has only recently evolved in jargon to mean "good". Hence, in 1950 the response might have correctly illicited movies that were undesirable or poorly made, while today that same query should correctly illicit movies that are really good. However, none of the proposed technologies has a notion of the temporal axis required to provide context to NLP queries. Hence, why I'm somewhat pessimistic about Powerset's hype, well that and the fact that many people have a tough time expressing themselves in writing, so how will an algorithm capture what a person means when the person doesn't have the tools (ability to write properly) to convey this. Call it a UI problem more than an algorithmic one."