The Surface/Symbol Divide: This approach to knowledge discovery is fixed at the surface level of text (and the surface level of the representation language of documents, to be complete). Consequently, the performance of the system highlights both what is good about statistical surface techniques (little training required - which is often the case for systems that work with both document structure, textual data and high precision seed input; works in (m)any language(s); fast) and what is bad (has no real knowledge of language). (Via Data Mining.)
What is "real knowledge of language"? Where does it come from? Why is it unobtainable with statistical techniques? For all we know, a somewhat more sophisticated statistical inference procedure might get rid of some of the errors that Matt highlights (I have some ideas that are too tentative to discuss). More generally, given how quickly our understanding of language acquisition is changing, how can anyone say surely what "real knowledge of language" entails? It's time to retire the essentialism of "colorless green ideas".