Tuesday, August 14, 2007

South America

Lake Nahuel HuapiI'm leaving tomorrow for Buenos Aires on my way to two weeks of skiing in Argentine Patagonia and in the Chilean Andes near Santiago de Chile. My friends down there say that they haven't seen ski conditions this good in a long time. Off to finish packing.
Ski Arpa

Monday, August 13, 2007

The Surface/Symbol Divide

The Surface/Symbol Divide: This approach to knowledge discovery is fixed at the surface level of text (and the surface level of the representation language of documents, to be complete). Consequently, the performance of the system highlights both what is good about statistical surface techniques (little training required - which is often the case for systems that work with both document structure, textual data and high precision seed input; works in (m)any language(s); fast) and what is bad (has no real knowledge of language). (Via Data Mining.)

What is "real knowledge of language"? Where does it come from? Why is it unobtainable with statistical techniques? For all we know, a somewhat more sophisticated statistical inference procedure might get rid of some of the errors that Matt highlights (I have some ideas that are too tentative to discuss). More generally, given how quickly our understanding of language acquisition is changing, how can anyone say surely what "real knowledge of language" entails? It's time to retire the essentialism of "colorless green ideas".

Thursday, August 9, 2007

Echoes from the dance of the elephants

Echoes from the dance of the elephants: A few days ago, I learned that I was the author of a chapter in a book whose existence I had previously not suspected, and that as a result, a medium-sized European publishing conglomerate had paid a not-entirely-trivial sum of money to a much larger European publishing conglomerate. This makes me feel, in a small way, like an athlete who learns that he has been traded from one team to another. Except that I don't have to move. [...] his sort of publishing has become a strange ceremonial dance among business conglomerates, the libraries of research universities, and the governments who pay the library costs. It plays almost no role at all in actual scientific and scholarly communication, at least in the fields that I work in. [...] The libraries who buy these publications are mostly, in the end, funded by taxpayers. Certainly in the U.S., the budgets of university research libraries form part of the overhead that universities charge on government research grants (which of course also pay for much if not most of the research whose results are published or reprinted in these volumes). In general, research libraries are wonderful institutions, more than worth what they cost; but the process that we're talking about is driving their costs way up, with little benefit to anyone except the publishing conglomerates. (Via Language Log.)

A couple of months ago, I linked to a critique of academic libraries by Clay Shirky:

Academic libraries, which in earlier days provided a service, have outsourced themselves as bouncers to publishers like Reed-Elsevier; their principal job, in the digital realm, is to prevent interested readers from gaining access to scholarly material.

Adam Corson-Finnerty from the Penn Libraries commented on my post, criticizing that "slam" on academic libraries. The Penn Libraries are outstanding, and they have been very progressive in their development and adoption of appropriate technologies, but all academic libraries have to seriously ask themselves whose interests they are serving when they continue "business as usual" with the rent-seekers in the academic publishing cartel. The example Mark discusses shows another facet of the problem. The only reason Routledge publishes such useless collections is that a few hundred sleepwalking academic librarians are willing to write a big check for a very strained acquisitions budgets. If any faculty member asks their library to buy such a wasteful collection, the librarian should push back, awkward as that might be. Libraries need to not only embrace open access, institutional archiving, and self-archiving, but lead by example and persuasion. The Penn Libraries have done more than most in these areas, but we all need to do more to retake control of the diffusion of our intellectual production.

Wednesday, August 8, 2007

Why do online-only OA journals use PDF?

Why do online-only OA journals use PDF?: Andy Powell, Open, online journals != PDF ?  eFoundations, August 6, 2007.  Speaking of the International Journal of Digital Curation (IJDC):

Odd though, for a journal that is only ever (as far as I know) intended to be published online, to offer the articles using PDF rather than HTML.  Doing so prevents any use of lightweight 'semantic' markup within the articles, such as microformats, and tends to make re-use of the content less easy.
(Via Open Access News.)

Doesn't seem so hard to figure out. HTML is awful for mathematics and scientific graphics. Just compare our recent paper in HTML and in PDF, even though the math in the PDF version is not as readable in PLoS's required Word format as it was in our original LaTex.

Tuesday, August 7, 2007

Would you rather be a theorist or an experimentalist?

Why it’s OK not to be Sean: There's an old chestnut that theorists are judged by their best paper and observers/experimentalists by their worst. (Via Cosmic Variance.)

Must be an old chestnut for physicists, I hadn't heard it before. I thought for a moment that playing theorist might get me a free lunch, but then I realized that my best paper would be compared with the best papers of all theorists. I think I'll continue to take my chances with the experimentalists...

Monday, August 6, 2007

Evo-devo and computation

Sci Foo recap: If I were to do it all again, I'd offer up an intro to evo-devo, in particular because some of the more gung-ho genomics talks seemed so oblivious to the difficulties of the fancier projects they were saying would be in our future. I really think the organismal-form-from-DNA problem is going to make the protein folding problem look trivial, and this is especially going to be true if the DNA Mafia is going to pretend the developmental biologists don't exist. (Via Pharyngula.)

A computer science point of view makes this point easier to understand. At the genomic level, evo-devo focuses on the evolution of the switches that control gene expression spatially and temporally in development in development. To a first, discrete approximation, these switches form Boolean combinations of transcription factors (themselves the expressions of genes) that gate the expression of another gene. There are also feedbacks and delays in the system. So, we have a pretty powerful computational device, and we know that recreating (learning) such a system from its behavior is in even relatively simple cases (finite-state machines) extremely hard.

We might hope that the system is constrained in ways that make it easier to reconstruct from behavior than the worst-case results suggest. But I see no functional reason why that should be the case. "Easy to reverse engineer" doesn't seem to have an evolutionary advantage, and it may actually be disadvantageous, in that it could facilitate the evolution of parasites and other attackers. (Think of the defensive advantages of encrypted communication).