tag:blogger.com,1999:blog-6916168470376937425.post2313752880849144165..comments2024-03-06T06:27:15.764-08:00Comments on Earning My Turns: The Surface/Symbol DivideFernando Pereirahttp://www.blogger.com/profile/05849361902113771573noreply@blogger.comBlogger4125tag:blogger.com,1999:blog-6916168470376937425.post-68410887202185809702007-08-14T16:35:00.000-07:002007-08-14T16:35:00.000-07:00Regarding William's point b: for this task, you do...Regarding William's point b: for this task, you do not need POS tags, but a way of classifying tokens based on use and context that carries the relevant information, which is the distinction between proper names and other lexical categories. Of course, that assumes some morphological preprocessing, which might be challenging for highly inflected languages.Fernando Pereirahttps://www.blogger.com/profile/05849361902113771573noreply@blogger.comtag:blogger.com,1999:blog-6916168470376937425.post-50622020744299052462007-08-14T14:41:00.000-07:002007-08-14T14:41:00.000-07:00Fernando says: "...parts-of-speech can be distingu...Fernando says: "...parts-of-speech can be distinguished fairly well by unsupervised "surface" statistical methods. This system does not do this because there are only so many hours in the day of even the brightest graduate student."<BR/><BR/>Also because a) it's mostly finding semi-structured pages (lists and tables and such) where POS tagging would be less reliable, and b) because it's language-independent.<BR/><BR/>I generally agree with both of you. I think the specific problem Matt points out can be (and probably will be) fixed with some additional analysis. <BR/><BR/>But I think the overall question remains - there are clear limitations to this sort of technique. It's very much oriented toward exploiting redundancy across sites, and it doesn't usually work for things that aren't relatively popular and well-known named entities. This is a sort of wisdom-of-crowds result, and finesses the problem of doing any real understanding of any particular page. In fact, it will get poor results on many of the hundreds of pages it processes - it works because its aggregating information across many poorly-understood information sources.William Cohenhttps://www.blogger.com/profile/01137759014585021440noreply@blogger.comtag:blogger.com,1999:blog-6916168470376937425.post-58420188158040652982007-08-14T05:51:00.000-07:002007-08-14T05:51:00.000-07:00I too thought "(real (knowledge of language))" is ...I too thought "(real (knowledge of language))" is what you meant. My general point is that "knowledge of language" has the same lack of explanatory power as other essences like "vital force". More specifically, you must know that parts-of-speech can be distinguished fairly well by unsupervised "surface" statistical methods. This system does not do this because there are only so many hours in the day of even the brightest graduate student.Fernando Pereirahttps://www.blogger.com/profile/05849361902113771573noreply@blogger.comtag:blogger.com,1999:blog-6916168470376937425.post-60414888557048787612007-08-13T21:25:00.000-07:002007-08-13T21:25:00.000-07:00Fernando - I think you have mis-parsed 'no real kn...Fernando - I think you have mis-parsed 'no real knowledge of language' in this case. It is not ((real knowledge) of language) it is (real (knowledge of language)). It is a subtle difference, but it is a difference. You should, then, be asking 'what is "knowledge of language"?' In this case, it would be the ability to distinguish parts of speech - clearly, this system is not capable of doing this.Matthew Hursthttps://www.blogger.com/profile/04448181529656349441noreply@blogger.com