Tuesday, January 2, 2007

Natural Language Search

Natural Language Search:
Matt Hurst:
Fernando writes clearly about what the article could have been about, though I fear that his expectations for the intersection between journalists, their audience and this particular subject may be to high.

I disagree. If Times science writers like Natalie Angier or Nicholas Wade can write deeply and clearly about the most difficult scientific questions for the Times readership, I see no reason why the same expectations of quality should not apply to technology writing.


Then Matt gets into the substance:


Fernando writes:
The fundamental question about NLS is whether the potential gains from
a deeper analysis of queries and indexed documents are greater than the
losses from getting the analysis wrong.
I actually disagree with this as being the right question. I see NLP as being a strong contender for changing the utility of the web, and our interfaces into it (a.k.a. search engines) from the discover of documents to the discovery of knowledge and information. Yes, that will be backed by documents, but they won't be the primary 'result'. For example, when I ask 'who invented the elevator?' I don't mean 'find me documents that, with a high probability, contain text that will answer the question: who invented the elevator?'. I really mean who invented the elevator?
NLS has the potential to come back with the result: Elisha Graves Otis.

I think that this distinction between "knowledge and information" and "documents" is a red herring, for several reasons:

  • Useful information is information in context. I don't just want to know a so-called fact, I want to know who stated it, where, and how. Your example is exactly the kind of example that people working in NLQA use all the time, and also the kind of example that is pretty useless except for trivia games and bad middle-school essays.

  • Even when we want to aggregate information across documents, we want the documents directly accessible to assess provenance and thus quality of the extracted information. For an example, check out the gene lister on Fable.

  • When the information is not explicitly stated in document, I am very skeptical of current methods for drawing inferences involving information scattered among documents and other sources. There's good research going on on this, such as in the textual entailment challenges, but it is very far from what we would want to reply on for a practical search engine.


Finally, Matt complains rightly about this medium:
Ok - now allow me to complain about something else. I posted about the NYT article and Fernando, I assume, read my post and wrote his. I have now written a follow up and we have all linked to each other nicely. However, consider how annoying it is to follow this 'conversation.' Fernando could have left  a comment on my post. I could have left a comment on his (though he actually has them turned off). The fact that there are multiple ways for this discussion to flow and there are no integrated mechanisms for readers (or writers) to tune in to the discussion makes a lie of the whole 'conversations in the blogosphere' propostion. It's been a problem for a long time and is an element of a theme which I think will be important next year - the efficiency of social media.

I didn't leave a comment in Matt's blog for two reasons:

  1. I want my web writing on one place, so that whoever gets my feed can see it.

  2. I dislike the editing environment of blog comment boxes.


I don't allow comments on my new Blogger blog because I have a low opinion of the S/N ratio in blog comments. I totally agree with Matt that this is not good. What I would like is a means for creating unified discussion threads in a distributed fashion. That is, Matt writes, I comment, he comments back, someone else chimes in, all in our own blogs, but a virtual thread is established that can be easily read by anyone.

No comments: