Some five years ago, I was talking with George Doddington at a Human Language Technology (HLT) meeting in Arden House about the difficulty of building accurate, robust natural-language relation extraction systems, even for limited sets of relations. I commented that the problem is that the input-output function implemented by such a system is not "naturally" observed. I meant that (almost) no one does relation extraction for a living and writes down the result. We all read text and take notes, and some of those notes are about relations expressed in the text. But there are no large collections of text and all the relations expressed in the text, or even all the relations from a prescribed set. Indeed, one might wonder if the function from texts to sets of expressed relations ``exists'' at all, in the sense that it is not obvious that people ever perform such a function in their heads, except as an evanescent step in their interpretation of and response to language.
In contrast, parallel corpora consisting of a text and its translation are widely available, simply because translation is something that is done anyway for a practical purpose, independently of any machine translation effort. Similarly, parallel corpora of spoken language and its transcription are created for a variety of practical purposes, from helping the deaf to court records. That is, the input-output function implemented by a machine translation or speech recognition system is explicitly available to us by example, and those examples can be used to validate proposed implementations or to train implementations with machine learning methods.
Anybody who has ever been involved in efforts to annotate text with syntactic or semantic information for use as ``ground truth'' in NLP work knows how difficult it is to get consistent, accurate results even from skilled human annotators. The problem is that those annotations are not ``natural.'' They are theoretical constructs. Annotators need to be instructed in annotation with instruction manuals many pages long. Even then, many unforeseen situations arise in which reasonable annotators can differ greatly. That's one reason why it has been so difficult to develop usable relation extraction systems. If people can only agree on extracted relations 70% of the time, how can we expect people and programs to agree more often? A short paragraph may specify some 20 relations. .7^20 ~ 0.0008, that is, a vanishingly small chance of getting it all correct. That's not because people cannot agree on the meaning of the paragraph left to their own devices, but because the annotation procedure cannot be specified precisely so that it can be reproduced faithfully by multiple annotators. If it could, then we could as well use it as the specification for a program. The same problem arises in natural language search. There are no extensive records of the intended input-output function.
PARC's methods of language analysis and interpretation are the best I know of in their class. But they suffer from the critical limitation that their output is a theoretical construct and not something observable. Without a plentiful supply of input-output examples, it is extremely difficult to judge the accuracy of the method across a broad range of inputs, let alone use machine learning methods to learn how to choose among possible outputs.
The situation is different for keyword-based search, because the input-output function is in a sense trivial: return all the documents containing query terms. The only, very important, subtlety is the order in which to return documents, so that those most relevant are returned first. Relevance judgments are relatively easy to elicit in bulk, compared with trying to figure out whether an entity or relation is an appropriate answer to a natural language question across a wide range of domains.
The availability of large ``natural'' parallel corpora has enabled the relatively fast progress of statistical speech recognition and statistical machine translation over the last 20 years. Those systems are still pretty bad by human standards, but they are often usable. Machine translation is ``possibly the hardest AI problem known to man'' only if you expect human performance.
So, is natural language search impossible? No, I think there are ways to proceed. The critical question is whether we can find plentiful ``natural'' correspondences between questions and answers that we can use to bootstrap systems. There are some interesting ideas around on possible sources, and I expect more to emerge over the next few years.