Thursday, December 27, 2007

an open access victory

an open access victory: in a word ... wow (Via tingilinde.)

A very important value of this opening will be to allow full indexing and analysis of the past literature. Sometimes we have the illusion that the latest publication is the one that matters, but in many cases, discovery is bumpy and long-drawn, so the ability to find and synthesize the whole history of a topic is very important in assessing the current state of knowledge. The potential of automated biomedical literature mining has just become that much greater.

Tuesday, December 25, 2007

Climate change for skeptical environmentalists

Climate change for skeptical environmentalists: A science teacher in Independence, Oregon lays out the scenarios Bishop Berkeley style, with stunning simplicity. He points out that even if we don't know whether global climate disruption is real, we can decide whether or not to act. And when we weigh the risks of acting against the the risks of not acting, even if climate change might not be happening, our course is clear.This video has had over 2.9 million views already. Nice.


Now that you've seen it (Go ahead, watch it) imagine the same argument laid out in a written essay. As compelling? No way. As easily accessible by millions of people? Not. The medium is the message. Barriers to videos that would raise the threshold for little gems like this would be socially irresponsible. (Via isen.blog.)

The argument in the video seems to rely implicitly on the naive assumption that uncertainty has to be maximum, that is, the two outcomes have equal probability. But a skeptic exploiting that is forced to assign those probabilities to make the expected loses from action greater than the expected losses from inaction, or to assert that the worst case scenario is much less bad than suggested. Either move requires the skeptic to deny a lot of evidence. Which is what skeptics are doing. I like the effort in the movie, but without hard numbers, it is always possible for the skeptics to flip the movie's qualitative calculus towards inaction.

Friday, December 21, 2007

New music

I've been too busy to blog much. My 30 MB iPod is full. But I can't stop getting new music:

Monday, December 10, 2007

Sentiment Mining: The Truth

Sentiment Mining: The Truth: Nathan Gilliat (o excellent blogger) posts about BuzzLogic's new partnership with KDPaine which will deliver sentiment scores to BuzzLogic's clients. There are a number of approaches to delivering sentiment analysis including many automated approaches and some manual ones. The customers are still skeptical of automated methods and generally more comfortable with manual methods. Paine writes:

"Computers can do a lot of things well, but differentiating between positive and negative comments in consumer generated media isn’t one of them,” explained Katie Delahaye Paine, CEO of KDPaine & Partners. “The problem with consumer generated media is that it is filled with irony, sarcasm and non-traditional ways of expressing sentiment. That’s why we recommend a hybrid solution. Let computers do the heavy lifting, and let humans provide the judgment."
This kind of statement is particularly unhelpful. Let's break it down. (Via Data Mining.)

Worth reading the whole post, where Matt gives a nice summary of strengths and weaknesses of statistical NLP/machine learning methods for sentiment classification that is also relevant to other applications.

This reminded me of a study I heard about in my AT&T days comparing automatic speech recognition and keypad entry for phone-based services. The conventional wisdom was that speech recognition would have to be worse, given how bad automatic speech recognizers are compared with human operators. Except that users made more mistakes with the keypad than the speech recognizer made with their utterances. It is also common to assume that automated text information extraction systems “must” be worse than human annotators, but I know of at least one comparison between outsourced manual extraction and automatic extraction where, again, human performance was worse than machine performance. People are just not that good at those tasks on average, although a person interested in a particular task instance will typically do much better than the best program.

The assumption that algorithms “must” be less accurate than people doesn't seem to be based on solid empirical evidence. However, when customers talk about accuracy, what they really mean is trust. We are more willing to trust human annotators because we (think that we) can understand how they perform the task, and we feel we could, at least in principle, query them about their reasoning if we doubted their conclusions. Whether this trust is warranted is another matter. Going by how often irony and sarcasm are misinterpreted in online communication (thus all those emoticons), we may not be such good modelers of the judgments of others.

Monday, December 3, 2007

Roll over, Beethoven: Deutsche Grammophon ditches DRM

Roll over, Beethoven: Deutsche Grammophon ditches DRM: Deutsche Grammophon is one of the most respected classical music labels in the world, and it just happens to be a subsidiary of Universal Music Group. With DG dropping DRM in favor of MP3, has Universal finally made up its mind about DRM? (Via Ars Technica.)

This is interesting, although why, why don't they provide AAC too? Still, I may need a bigger iPod.

Thursday, November 29, 2007

bumps and valleys...

bumps and valleys...:

Alta


I was using Google maps for its satellite view and noticed a terrain view tab ... (Via tingilinde.)

How come Steve chose one of the most popular near-backcountry skiing destinations for this example, the Catherine's/Lake Mary/Wolverine area NE of Alta? Is he trying to rub in the lack of snow there until a few days ago?

Monday, November 26, 2007

Moteurs : Comparaison Google-Yahoo

Moteurs : Comparaison Google-Yahoo: Le résultat le plus étonnant provient de l’utilisation de Wikipedia. Cette utilisation était marginale en décembre 2005 (voir étude). A l’époque, sur l’ensemble des 10 résultats de la première page, Google retournait 2% de liens provenant de Wikipedia et Yahoo 4%. Sur le premier lien seul, Google ne retournait aucun résultat de Wikipedia (du moins dans notre échantillon) et Yahoo 7%. [...] La note moyenne attribuée par les utilisateurs lorsque le résultat est dans Wikipedia est de près d’un point supérieure, dans le cas de Google comme de Yahoo, à la note attribuée aux autres résultats. Pousser Wikipedia à l’extrême est donc une stratégie payante à peu de frais. Elle est toutefois dangereuse. Le jour où les utilisateurs s’apercevront que, par exemple grâce à la barre de recherche de Firefox, ils peuvent chercher directement dans Wikipedia s’ils veulent des informations encyclopédiques, dans Wikio pour l’actu et les blogs, dans Allociné pour le cinéma et ainsi de suite, le concept (vieillot, à mon sens) du moteur généraliste aura du plomb dans l’aile. On commence à en percevoir les limites. (Via Technologies du Langage.)

Interesting study, but I disagree with this conclusion, at least in the absence of further experimental investigation. While Wikipedia may be a convenient first stopping point with average high credibility, having a diversity of sources on the first search page is very important. Often I search for a term that I have a basic understanding of to get more specialized resources, when the Wikipedia entry does not add much I don't know already. Encyclopedias are good as a first entry point into a subject, but not that good for detail, associated material, or timeliness.

This Climate Goes to Eleven

This Climate Goes to Eleven: Gerard H. Roe and Marcia B. Baker, "Why Is Climate Sensitivity So Unpredictable?", Science 318 (2007): 629--632 [...] Roe and Baker's argument is simple but ingenious and compelling. The climate system contains a lot of feedback loops. This means that the ultimate response to any perturbation or forcing (say, pumping 20 million years of accumulated fossil fuels into the air) depends not just on the initial reaction, but also how much of that gets fed back into the system, which leads to more change, and so on. [...] Suppose, just for the sake of things being tractable, that the feedback is linear, and the fraction fed back is f. [...] What happens, Roe and Baker ask, if we do not know the feedback exactly? Suppose, for example, that our measurements are corrupted by noise --- or even, with something like the climate, that f is itself stochastically fluctuating. The distribution of values for f might be symmetric and reasonably well-peaked around a typical value, but what about the distribution for G? Well, it's nothing of the kind. Increasing f just a little increases G by a lot, so starting with a symmetric, not-too-spread distribution of f gives us a skewed distribution for G with a heavy right tail. (Via Three-Toed Sloth.)

Interesting study. Besides the scary aspects of this finding relative to climate change, there's the more academic question of whether these types of processes can explain skew distributions we find in other fields.

Friday, November 23, 2007

Art, Science & Truth: Deep Nonsense

Art, Science & Truth: Jonah Lehrer: Reading Jonah Lehrer's Proust Was a Neuroscientist is something like watching Jacoby Ellsbury in the Red Sox outfield. [... ] Lehrer's stylish little book is a brief for art in an age of science. He stands with artists, for starters, because as he argues in eight signal lives, they hit the target first, about brain science in particular: poet Walt Whitman's intuition of "the body electric," for example; or novelist George Eliot's confrontation with systems thinking (Herbert Spencer, in person, and the invented Casaubon in Middlemarch) and her elevation of the indeterminacy of real life; or Paul Cezanne's methodical discovery of our eye's part (and our imagination's) in completing the experience of a painting. [...]

Scientists describe our brain in terms of its physical details; they say we are nothing but a loom of electrical cells and synaptic spaces. What science forgets is that this isn't how we experience the world. (We feel like the ghost, not like the machine.) It is ironic but true: the one reality science cannot reduce is the only reality we will ever know. This is why we need art. By expressing our actual experience, the artist reminds us that our science is incomplete, that no map of matter will ever explain the immateriality of our consciousness.
Jonah Lehrer, Proust Was a Neuroscientist, page xii.
(Via Open Source.)

I listened to this podcast at the gym today. I must have worked out harder to burn off the irritation with so much flim-flam. Science is Chris Lydon's weakest area by far. He's too willing to accept mystical pieties from his subjects that he would probe sharply in an interview about Iraq or Emerson.

Proust and Musil supplied important places away from my research when I was in graduate school. Reading À la Recherche du Temps Perdu in the original required such concentration that the difficulties with my work were erased for a while. But neither Proust nor Musil were really outside my most serious research concerns. Proust on Elstir or Musil on Moosbrugger raised tantalizing questions about perception, consciousness, and free will. So I was ready to be sympathetic towards Lehrer's book, which was in my “to read” list. No more.

In the interview, Lehrer talks in hushed tones about the “essential mystery” of individual experience that cognitive science will “never” answer. Lydon seems almost relieved that there's some mystical core left after all.

Lydon doesn't realize that Lehrer's mystery is trivial, a result of confusion between the particular and the general.

What cognitive science seeks is a general account of cognitive mechanisms. What art provides are particular accounts of experience, valuable exactly because of their particularity. A general account of cognitive processes cannot predict particulars any more than the logic diagram for this Intel Core Duo can predict what instruction will execute next. That instruction is determined by a combination of the processor, the contents of memory, and events in the outside, like the keys I tap and the packets that arrive on the net interface.

Even if we had a complete wiring diagram of someone's brain, we could not predict the next neuron firings, let alone the next action of the subject, because we don't know the contents of memory, encoded in the states of synapses and of individual neurons (such as feedback-stabilized patterns of gene expression), nor what particular sensory events will happen next.

More generally, Lehrer seems to be totally oblivious to the huge 20th century discovery that unpredictability is the rule for sufficiently powerful computing devices. The unsolvability of the halting problem is just the most extreme case of unpredictability: no general method can predict in finite time whether an arbitrary program will halt. A good pseudo-random number generator is unpredictable if we do not know its seed. Thinking of individual experience as a unique bit stream, it is not surprising that individual behavior is so unpredictable: we all have different seeds. In addition, cryptographic arguments show that a combinatorial circuit of sufficient complexity cannot be reconstructed from a polynomially-sized sample input-output behavior.

If Lehrer wanted to puzzle over a real question that matters in this argument, he could have asked about our current lack of proof for the cryptographic assumptions used in the above argument. Now, there's a mystery. Not an “essential” one, we hope, but certainly a resistant one.

It is somewhat depressing that even highly educated people like Lehrer are so ignorant of the amazing discoveries on the limits of computation since 1936, and what they may imply for our understanding of the mind; and that they seem ready to go all weak at the knees with mystical copouts as soon as the opportunity presents itself.

To admire Proust or Musil I need no mystery: it is enough that they could create compelling experiences that illuminate the uniqueness and weirdness is all of us, which will stand however much we know about brains and minds, not because of any mystery, but because computation has limits. Unpredictability makes us free.

Update: Complementary claims of nonsense.

Thursday, November 22, 2007

In memoriam Maurice Gross. (arXiv:0711.3452v1 [cs.CL])

In memoriam Maurice Gross. (arXiv:0711.3452v1 [cs.CL]): Maurice Gross (1934-2001) was both a great linguist and a pioneer in natural
language processing. This article is written in homage to his memory
(Via cs updates on arXiv.org.)

I met Maurice Gross a few times. He had long-standing connections with Penn, and he was a charming host when I visited his lab for a thesis defense. After the event, we had a memorable dinner at a local restaurant, where Maurice amused us with stories about his country house, wine-making, and I'm sure many other topics I don't remember now. He recommended that I try clafoutis for desert; I had never had it before, and it was superb. This experience agrees well with Eric Laporte's account.

Maurice Gross was right and ahead of his time in his focus on the local grammar of lexical items, and in recognizing the combinatorial uniqueness of individual lexical items, in contrast with the very impoverished tag sets of standard generative grammar. The use in his laboratory of finite-state transducers for local grammars was highly original. However, I'm less sure that their specific approach is sufficient. The local grammar approach imposes extreme constraints on the interactions between lexical items, and seems too brittle to handle natural variation. Local grammars are better as compact summaries of observations than as models of the interactions and variations that may occur. Lexicalized TAG has a similar flavor of local grammar, but it allows greater combinatorial flexibility and generalization. Still, both the overall view of language and the specific methods that Maurice Gross pioneered deserve continued study. In our rush to build theories and systems, we keep forgetting that language is much more an assemblage of particulars than the neat result of a few general principles. We need to savor it slowly, as if were sitting for dinner with Maurice.

Wednesday, November 21, 2007

Hands on with Kindle

Hands on with Kindle: [...] The layout? Not so great. Forced justification with apparently no hyphenation dictionary or hinting in the format. That's a huge failure. On a private list, I noted that, "Justification without hyphenation is like taxation without representation." That is, the poor letters and word groupings have no input into how they're displayed, which makes for a poor republic.

(Via TidBITS.)


That's all we need to know about the Kindle. Thanks, Glenn Fleishman!

Thursday, November 15, 2007

Edgar Bronfman, Jr. Reported to Talk Straight

Edgar Bronfman, Jr. Reported to Talk Straight: Edgar Bronfman, Jr.'s efforts to become a powerful figure in the entertainment world, via money derived from his father's Seagram empire, are long-standing. [...] The boss of Warner Music has made a rare public confession that the music industry has to take some of the blame for the rise of p2p file sharing.

"We used to fool ourselves,' he said. "We used to think our content was perfect just exactly as it was. We expected our business would remain blissfully unaffected even as the world of interactivity, constant connection and file sharing was exploding. And of course we were wrong. How were we wrong? By standing still or moving at a glacial pace, we inadvertently went to war with consumers by denying them what they wanted and could otherwise find and as a result of course, consumers won."
(Via The Patry Copyright Blog.)

It only took ten years for him to start getting a clue. Efforts to work with him and other music executives on digital distribution go back much further than the start of Apple's music store. It's a sad reflection of their stewardship of their business that instead of leading the charge to digital, they dug in their heels for a decade and only started waking up when p2p and Apple pushed them into a corner. I'm sure their shareholders are delighted.

making bicycling safe in the us

making bicycling safe in the us: Important stuff - what Berkeley has done. It isn't the Netherlands, but far better than most US cities (except for Davis, CA perhaps) (Via tingilinde.)

I'd be a bit more sympathetic to such initiatives if it were not that in over six years in Philly as a pedestrian and public-transportation user, most times I've come close to injury have been due to cyclists switching between roadway and sidewalk at high speeds, riding against the traffic in one-way streets, and riding through red lights.

There's a holier-than-thou attitude among cycling activists that really rankles. Sharing the road applies to all road users, not just motorists.

Tuesday, November 13, 2007

Cristina Branco

CRISTINA BRANCO canta SHAKESPEARE: My sister links to a YouTube clip of Cristina Branco singing Se a Alma te Reprova, a translation of Shakespeare's Sonnet 136, which is also included in the album Sensus (also available as an Amazon MP3). For me she's the best Portuguese singer in a long time, musically more creative and emotionally more subtly expressive than others who are better known. (Via aguarelas de Turner.)

Tuesday, November 6, 2007

Jim Lehrer sticks up for traditional media

Jim Lehrer sticks up for traditional mediaThe bloggers are talkers, commentators, not reporters. The talk-show hosts are reactors, commentators, not reporters," Lehrer said. "The search engines can search but do not report. All of them, every single one of them, have to have the news in order to exist and thrive.

Let's grant him that. The question is then, how will serious reporters outside public broadcasting make a living as the paper media shrivel and the broadcast media replace news by infotainment? Even if online advertising would be a sufficient revenue source to support the reporters, how would the revenue flow to reporters without the current (obsolescent) news production mechanisms? Maybe some new form of syndication, but we don't have those mechanisms in place yet (open access scientific publication suffers form a parallel problem). We better start researching the design of new syndication mechanisms.

Friday, November 2, 2007

Open Source ML Software Track in JMLR

Open Source ML Software Track in JMLR: What a great idea. (Via Cranial Darwinism.)

I agree, but then I had a teeny bit to do with developing the prospectus for this new JMLR track, which came out of a workshop at last year's NIPS. At the workshop, I argued that resources follow academic recognition and citation, so we need a means to have software peer-reviewed and cited to attract more resources to open-source development efforts in machine learning.

Wednesday, October 31, 2007

Amazon 2, iTunes Store 0

Another album, Cristina Branco's Ulisses that Amazon has in DRM-free 256kb MP3, while the iTunes Store has only DRM-ladden 128kb AAC. I used to have this album via iTunes, but a mistake when trying to rebuild my files after a disk crash sent it to /dev/null.

Saturday, October 27, 2007

Hard Road West

I've been reading Hard Road West: History and Geology along the Gold Rush Trail by Keith Heyer Meldahl. If you are interested in the mountains west from the Rockies, you need to read this book. The jacket blurb mentions John McPhee, but this is a very different book from McPhee's geology series. McPhee is a superior prose stylist who seeks to distill his encounters with the landscape and its professional observers into concise, striking, almost choppy prose. Meldahl is a less refined writer, but his geology is detailed, leisurely, brimming with a enthusiasm for the rocks that contrasts with McPhee's almost clinical detachment. In spirit — not in style — Meldhal reminds me of Stegner's John Wesley Powell biography Beyond the Hundredth Meridian. There's a sense of place, a love for a harsh landscape in both books that are very powerful in these two books.

Amazon vs iTunes, round two

Last month I tried the new Amazon MP3 service but I found a bug that caused files to be delivered at 160kb compression instead of the advertised 256kb. A few days ago, Amazon contacted me saying that the bug was fixed and asking me to try again to download the same album, Anouar Brahem's Le Voyage de Sahar. I just went through the process, which was a bit messy because I had to download the .amz files for each track individually and then get the real tracks downloaded by the Amazon downloader. In addition, there was a bit of confusion between the old, 160kb copies and the new 256kb ones, but all is well now and I'm listening to another Brahem wonder.

Out of curiosity, I checked the iTunes store for this album. They have it, but with DRM at 128kb AAC, not their DRM-free, higher-quality iTunes Plus. What's with this? Between DRM-free 256kb MP3 and DRM-encumbered 128kb AAC, it's no contest. I prefer AAC other things being equal, but this is definitely not equal.

Thursday, October 25, 2007

Help Scott Aaronson (and me) manage email

Halloween Special: My Inbox: Most Respected Profeser Sir Dr. Scot Andersen: I wish to join your esteemed research group. I have taken two courses in Signal Processing at the Technical College of Freedonia; thus, it is clear that I would be a perfect fit for your laboratory [...] (Via Shtetl-Optimized.)

There's a small but growing market here for a faculty social email filtering system that shares good responses to these types of messages and automatically selects one to reply with. A good project for a graduate student suffering from procrastinitis?

Monday, October 22, 2007

Soviet Telecoms

Mossberg:
That’s why I refer to the big cellphone carriers as the “Soviet ministries.” Like the old bureaucracies of communism, they sit athwart the market, breaking the link between the producers of goods and services and the people who use them.

Read the whole thing.

Commitment

Commitment: Isabella Bannerman frames the problem:

(Via Language Log.)

Now she tells us.

Saturday, October 20, 2007

Listening to Orchestra Baobab

A night at club baobab: Senegalese dance music of the 70's. From their early career, poor recording quality. Not the polish and variety of Specialist in All Styles, one of my top choices in West African music, but lively and warm-hearted.

More fun with VPE

More fun with VPE: From an Andy Gill article on D.A. Pennebaker's rockumentary Don't Look Back (on Bob Dylan), in The Independent of 4/27/07 (the Joan in question is Joan Baez):

[quote from Pennebaker:] "I guess I tried to make that film as true to my vision of him as I could make it. But as a storyteller, I wanted there to be stories in it."
Pennebaker was aided in this regard by Bob Neuwirth, a singer and painter who served as Dylan's tour manager. Neuwirth had proved himself Dylan's equal in droll acerbity - he's the one who jokes, "Joan's wearing one of those see-through blouses you don't even want to!" - and he clearly saw part of his job as providing entertaining moments for the camera.
Yes, one of those see-through blouses you don't even want to, with a Verb Phrase Ellipsis (VPE) on the edge. (Via Language Log.)

Back in the 90s I spent a lot of time puzzling over ellipsis with my friends Mary Dalrymple and Stu Shieber. Our semantic theory of ellipsis went against the prevailing syntactic theories of the time. The examples in this (as usual) amusing and erudite post by Arnold Zwicky are the kinds of syntactic-theory-busting nuggets we were looking for back then. It's tough digging, but someone has to do it.

Friday, October 19, 2007

Someone else's automata birthday

On Being Twice a Square:
It's not my birthday (thanks friends, but I'm in between doubled squares), I just love the graphic on this cake:


(Via Recursivity.)

Thursday, October 18, 2007

Fishing Expedition

Fishing Expedition: A favorite phrase employed by some proposal reviewers is to accuse the PI’s of proposing a ‘fishing expedition’, meaning that the PI’s don’t really know what they are going to find but are hoping to find something. This is typically meant as a devastating negative comment: that is, one should have a more certain prediction of the outcome of the research or the research should not be done. [...] A previous trend in proposal-trashing involved the phrase 'stamp collecting' -- i.e., mindlessly collecting and organizing data. What's the next hobby-related pejorative, once reviewers tire of 'fishing expedition'? My personal preference would be 'zorbing'. Any other suggestions? (Via FemaleScienceProfessor.)

I'd suggest grid crunching.


Mobile phone use backed on planes

Mobile phone use backed on planes: Passengers could soon be using their mobile phones on planes flying through European airspace. (Via BBC News | Technology | UK Edition.)

Are they also going to add air marshals to stop irate passengers gagging overly talkative mobile users? Airlines agreed to stop on-board smoking for the sake of safety and health, and smoking addicts dealt with it. But the tobacco companies are not as smart as the telecoms. If they had proposed a special on-board smoking fee to be divided between them and the airlines, we'd probably still have on-board smoking. Because it is evident that the extra charges for on-board calls will be nicely divided between the telecoms and the airlines. At the expense of all the passengers who do not suffer from mobile addiction.

Saturday, October 13, 2007

In Praise of Yeast

In Praise of Yeast: [...] One of the best studied of all genetic circuits in the world is the one yeast uses to feed on a sugar called galactose. [...] So how did this elegant circuit evolve? (Via The Loom.)

To a programmer, this story shows a wonderful example of refactoring where one class (one gene) that has grown to do double duty is broken into two separate classes (to copies of the gene) that can then be changed to specialize on one task alone. The more I read about evo-devo, the more I notice uncanny parallels with the processes in large, long-lived software projects in which change is highly constrained by history and context. The main difference is that the programmers who change the code believe that the changes they make direct the code base towards increased fitness, whereas random mutations don't need to bother with the pretense. wink

Total Music, Uh-Huh

★ Total Music, Uh-Huh: BusinessWeek has a story — “Universal Music Takes on iTunes1 — regarding a supposed proposal from Universal Music chief Doug Morris to create a music-industry-owned subscription service called Total Music. BusinessWeek twists the story into pretzel-like contortions to present this scheme as clever and reasonable.

While the details are in flux, insiders say Morris & Co. have an intriguing business model: get hardware makers or cell carriers to absorb the cost of a roughly $5-per-month subscription fee so consumers get a device with all-you-can-eat music that’s essentially free.
[...] In and of itself, Total Music is not a ridiculous notion, just like regular pay-by-the-month subscription services aren’t ridiculous notions. But we all know that device makers aren’t going to eat the cost — they’re going to pass it along to consumers. A Total Music music player is going to cost somewhere around $100 more than a similar player without Total Music. And it’s not like subscription services haven’t been tried before. (Via Daring Fireball.)

This model might make sense when players have effectively infinite storage — $90 is not an absurd premium for a device preloaded with all the music ever recorded — but for the time being, it makes no sense. Subscription services have not worked so far, so why would they work now? Also, "Total" is not likely to be that total. Why pay $90 to the major labels when you care most about independent artists and labels who are unlikely to buy into a system that dilutes their brand for a trivial fraction of the shared pot?

Monday, October 8, 2007

Infinite Storage for Music

Infinite Storage for Music: Last week I spoke on a panel called “The Paradise of Infinite Storage”, at the “Pop [Music] and Policy” conference at McGill University in Montreal. The panel’s title referred to an interesting fact: sometime in the next decade, we’ll see a $100 device that fits in your pocket and holds all of the music ever recorded by humanity. [...] in a world of infinite storage, no searching is needed, and filesharers need only communicate with their friends. If a user has a new song, it will be passed on immediately to his friends, who will pass it on to their friends, and so on. Songs will “flood” through the population this way, reaching all of the P2P system’s participants within a few hours — with no search, and no communication with strangers. Copyright owners will be hard pressed to fight such a system. (Via Freedom to Tinker.)

Read the whole argument. Hard to disagree with. A corollary is that traditional (C) protection will be impossible except in a police state, and very difficult even in one (I have personal experience of police state inefficiency).

One possible outcome is that recorded music will become an advertising medium, and live performances what really rakes in cash, for their rarity and uniqueness. Sort of like high fashion, which is not protected by copyright but prospers from the status value of owning the "real thing".

Tuesday, October 2, 2007

The Technical-Social Contract

The Technical-Social Contract: We think we understand the rules of commerce. Manufacturers and sellers advertise; we buy or not, as we choose. We have an intuitive nderstanding of how advertising works, up to and including a rather vague notion that advertisers try to target "suitable" customers. Similarly, manufacturers and sellers have an understanding of how people buy and use their products. However, technology has been changing what's possible for all parties, frequently faster that people's assumptions. This mismatch, between what we expect and what is happening, is at the root of a lot of high-tech conflict, ranging from peer-to-peer file-sharing to Apple's iPhone. (Via SMBlog -- Steve Bellovin's Blog.)

A brief and lucid explanation of the main issues behind the file-sharing and iPhone fights. Highly recommended, with a bonus of links to very interesting supporting material.

Wednesday, September 26, 2007

Update on Amazon MP3

Amazon just emailed me that they had verified this bug and that they are working on fixing it. In the meanwhile, they are reimbursing me for the album I bought. That's good customer service! I hope that they fix the problem soon because I want that album and probably others. Now if they also offered AAC...

Unsolicited Advice, IV: How to Be a Good Graduate Student

Unsolicited Advice, IV: How to Be a Good Graduate Student: Past installments of Unsolicited Advice dealt with such mechanical topics as how to choose an undergraduate school or graduate school, or how to get into graduate school. (Hell if I know how to get into undergraduate schools.) Now we step fearlessly into somewhat more treacherous territory: how to be a good graduate student. As always, this is one idiosyncratic viewpoint, and others should be offered in the comments. (Via Cosmic Variance.)

Excellent advice. Even though it is from the point of view of physics graduate study, it applies well to computer science.

Tuesday, September 25, 2007

Possible bug in Amazon's music downloader

When I downloaded that Anouar Brahem album using Amazon's custom album downloader, somehow the tracks were stored at 160kb MP3 instead of 256kb MP3, the advertised compression rate. This puzzled me for a while, but I suspect that it may be a bad interaction with my iTunes CD import settings, which are 160kb AAC. Somehow, the Amazon downloader seems to have grabbed that rate and used it for my downloads, even though MP3 rates and AAC rates have nothing to do with each other. I've reported the problem to Amazon, I'll blog when I hear from them.

This is not surprising, it's a beta after all.


★ The Amazon MP3 Store and Amazon MP3 Downloader

★ The Amazon MP3 Store and Amazon MP3 Downloader: The new Amazon MP3 Store looks like no previous iTunes Store rival. The music is completely DRM-free, encoded at a very respectable 256 kbps, includes a ton of songs from major record labels, and offers terrific software support for Mac OS X. (Via Daring Fireball.)

I browsed around a bit in jazz and African music. Much interesting backlist material, but it lacks more recent work by favorite artists like David Holland our Toumani Diabaté. Surprisingly, for another favorite artist, Anouar Brahem, it has a several albums that I didn't know about, including a 2006 issue from ECM that I had missed!

I would prefer 160 or 192kb AAC rather than 256kb MP3, to save space on my iPod. But it is notable that finally we have a large catalog, non-subscription, non-DRM, non-proprietary online music store. I'll buy that Brahem album to test it out, and I'll keep paying attention. This could be finally the beginning of what we've been waiting for since the late 90s.


Monday, September 24, 2007

Search Quality is Brand Quality

Search Quality is Brand Quality: Greg Linden points to 'The Effect of Brand Awareness on the Evaluation of Search Engine Results', a paper by Bernard J. Jansen et al:

[...] Based on average relevance ratings, there was a 25% difference between the most highly rated search engine and the lowest, even though search engine results were identical in both content and presentation. We discuss implications for search engine marketing and the design of empirical studies measuring search engine quality.[...]

Greg summarizes some of the implications of branding with respect to competition in the search space. However, one might also consider branding in the context of a single provider. [...] Consider the challenge ahead of companies like Powerset and Hakia - which are attempting to bring another fundamental shift to search. Much of the criticism has been leveled at assumed issues with the technology. However, this is not the only battle ground. Establishing brand where there is none is a huge barrier to entry.

The cited study implies nothing of the kind. It ignores the well-known effect that when everything important (in this case, actual differences among search results) is removed from a stimulus, other variables that are still present dominate the response, even though those variables might be masked in a realistic situation. That is, the experimental design assumed but did not test additivity. A good study would have presented all combinations of brand and objective (double blind) search quality to the subjects. Then we would really know how much brand biases assessment of search quality.

Friday, September 21, 2007

Extreme functional programming

Mathematical Markings:

The Y Combinator.jpg (Via The Loom.)

I'm fond of the Y combinator myself, and of Shannon's entropy, ... but this is a kind of dedication that I can't understand. On the other hand, two knee surgeries from skiing are probably more interference with bodily integrity than many of these self-decorators would accept...

Thursday, September 20, 2007

Doc Searls: don’t count on ads

Doc Searls: don’t count on ads: Because I am always behind reading my feeds (aren’t you?) I only just read this post by Doc Searls from a week ago. Coming from a slightly different angle, using his increasingly valuable VRM argument, Doc’s “Toward a New Ecology of Journalism” arrives at a similar place to where I ended up earlier this week in the Times Select discussion:

…The larger trend to watch over time is the inevitable decline in advertising support for journalistic work, and the growing need to find means for replacing that funding — or to face the fact that journalism will become largely an amateur calling, and to make the most of it.

This trend is hard to see. While rivers of advertising money flow away from old media and toward new ones, both the old and the new media crowds continue to assume that advertising money will flow forever. This is a mistake. Advertising remains an extremely inefficient and wasteful way for sellers to find buyers. I’m not saying advertising isn’t effective, by the way; just that massive inefficiency and waste have always been involved, and that this fact constitutes a problem we’ve long been waiting to solve, whether we know it or not.

Google has radically improved the advertising process, first by making advertising accountable (you pay only for click-throughs) and second by shifting advertising waste from ink and air time to pixels and server cycles. Yet even this success does not diminish the fact that advertising itself remains inefficient, wasteful and speculative. Even with advanced targeting and pay-per-click accountability, the ratio of ‘impressions’ to click-throughs still runs at lottery-odds levels.

…The result will be a combination of two things: 1) a new business model for much of journalism; or 2) no business model at all, because much of it will be done gratis, as its creators look for because effects — building reputations and making money because of one’s work, rather than with one’s work. Some bloggers, for example, have already experienced this….

Just don’t expect advertising to fund the new institutions in the way it funded the old.

I think this is right, though the long-term-ness of the vision will have most hard-hearded business people smirking their disbelief as they point to corporate-media revenue numbers with long strings of zeroes dangling from them.

Great observations. We need news. We read news all the time. We just don't have a good way to pay for that need. I agree that direct advertising revenue may not be able to provide full support for high-quality news. On the other hand, if all news disappeared tomorrow, we'd have to find a way to make them come back, just as if all search engines or DNS disappeared tomorrow. Much internet traffic (not by volume, but by attention) involves news. So, those who benefit from that traffic — ISPs, search engines, social networking sites — better find ways to keeps news going. Some form of advanced syndication may be important here. The fact that search engines are increasingly making deals with news providers like the AP suggests that they see this.

Monday, September 10, 2007

iPhone critique

more thoughts on the new mobile: The worst feature by far (ignoring being forced to use one carrier) is the glacial molasses-in-January EDGE wireless service from AT&T. Don't even think of using the web browser using EDGE unless you have pressing needs to see one page (I'm talking about 1-2 minutes for a NY Times page to render). Performance in WiFi is spiffy - so it isn't the device. If you really need on-the-go web, forget this until a 3G alternative arrives. (Via tingilinde.)

EDGE seems to compete with other network uses in a really bad way. On my T-Mobile data service I sometimes get decent Web performance, but it becomes useless at busy times in places where there are a lot of calls going on, for instance airports. Packet losses of over 50% and ping times of over 5 seconds are not unusual in those situations.


Saturday, September 8, 2007

TR: Argentina and Chile

Here are some of the photos and a movie with some more explanation.

Tuesday, September 4, 2007

After ditching Apple, NBC opts for flex pricing and more DRM with Amazon

After ditching Apple, NBC opts for flex pricing and more DRM with Amazon: Showing us that it's not all about Hulu, NBC inks a download deal with Amazon just days after the public spat between Apple and NBC. What's Unbox got that Apple doesn't? Flexible pricing and less "flexible" DRM. (Via Ars Technica.)

Why am I not surprised?

Monday, September 3, 2007

Using del.icio.us as a Writing Summarization Tool

Using del.icio.us as a Writing Summarization Tool: Jeremy Zawodny:

It occurs to me that with a sufficient number of people bookmarking an article and selecting a short passage from it, I have a useful way to figure out what statement(s) most resonated with those readers (and possibly a much larger audience). It’s almost like a human powered version of Microsoft Word’s document summarization feature.
(Via Daring Fireball.)

Training material for automatically-trained document summarizers?

Music Subscriptions, DRM, and iPods

Music Subscriptions, DRM, and iPods: Music producer and would-be savior of the record industry Rick Rubin, in yesterday’s New York Times Magazine:

Quoted already in my earlier post on this topic.
[...] But here’s the problem with subscription-based music: you can’t have it without DRM. Because without DRM, what’s to stop someone from subscribing for one month, downloading every song they might ever want, then unsubscribing but keeping the music? And the thing with DRM is that people hate it, because it restricts what they can do and where they can play their music. To argue that subscriptions are the future of music is to argue that DRM is the future of music, and the evidence points to the contrary. (Via Daring Fireball.)

Exactly. It's not only wasteful and impractical as I noted before, it's also anti-user.

Argentina and Chile

Here's selection of photos from my recent trip, unedited and with minimal commentary. I may be able to put together a more detailed report later.

Execs: Future of music is subscription

Execs: Future of music is subscription: Rick Rubin, founder of Def Jam Recordings and now co-head of Columbia Records, is arguing that the future of music sales lies in a direction beyond the iPod and iTunes. In his conception, people would pay for subscriptions, but with more generous options than available on the likes of Napster.

You'd pay, say, $19.95 a month, and the music will come anywhere you'd like," he says. "In this new world, there will be a virtual library that will be accessible from your car, from your cellphone, from your computer, from your television. Anywhere. The iPod will be obsolete, but there would be a Walkman-like device you could plug into speakers at home.
(Via MacNN.)

The big jukebox in the sky. Didn't we hear that before from the telecoms sometime back in the late 90s? Mr. Rubin doesn't raise the obvious issues that killed the idea before, not does the NYT's reporter.

  • The device and communications costs of getting digital music on demand to devices, especially portable devices, over the air is always going to be much higher than downloading over wired connections and storing on memory. As anyone who has a digital wireless data plan in the US knows, you pay a lot for mediocre service; even in the most advanced wireless countries, the situation is not much better.
  • The music industry resents Apple, but Apple is a pussycat compared with the telecom oligopoly that it would have to depend on for wireless on demand distribution. Apple may have current market dominance, but it has many serious and deep-pocketed competitors like Microsoft. The telecoms have locked up the spectrum in Washington, and they will continue to extract monopoly rents from it for the indefinite future.

The fact is, the music industry has lost control of the means of distribution, and it will never regain it. Its current form was an accident of physically embodied sound reproduction. The music industry should try to learn from some of the more clued in content producers in news -- the wire services -- who are finding new distribution and revenue means for their content by striking deals with search engines like Yahoo! and Google.

Sunday, September 2, 2007

The Tale of the Mechanical Virus

The Tale of the Mechanical Virus: Sean McBride:Soon afterwards, I noticed that other people’s Macs were refusing to project as well. Person after person would plug their Mac into the projector, but to no avail. What was even stranger was that the affliction only seemed to affect Mac users. PC laptop users laughed at us as they projected with impunity. (Via Daring Fireball.)

Back from South America

Returned yesterday from skiing on six mountains, scoring some memorable descents, taking lots of pictures (to be posted), and carrying a definitely not pleasant respiratory bug now being subdued by modern pharma thanks to a great new medical service, Penn Urgent Care.

One lesson from this trip is that if it feels worse than a typical cold, it could be or become pneumonia, and that the lowered blood oxygen from the condition doesn't mix with high-altitude steep skiing. (Low oxygen is supposed to affect first basal ganglia circuitry involved in complex motor decision-making). A big bruise on my left shin proves the point.

Tuesday, August 14, 2007

South America

Lake Nahuel HuapiI'm leaving tomorrow for Buenos Aires on my way to two weeks of skiing in Argentine Patagonia and in the Chilean Andes near Santiago de Chile. My friends down there say that they haven't seen ski conditions this good in a long time. Off to finish packing.
Ski Arpa

Monday, August 13, 2007

The Surface/Symbol Divide

The Surface/Symbol Divide: This approach to knowledge discovery is fixed at the surface level of text (and the surface level of the representation language of documents, to be complete). Consequently, the performance of the system highlights both what is good about statistical surface techniques (little training required - which is often the case for systems that work with both document structure, textual data and high precision seed input; works in (m)any language(s); fast) and what is bad (has no real knowledge of language). (Via Data Mining.)

What is "real knowledge of language"? Where does it come from? Why is it unobtainable with statistical techniques? For all we know, a somewhat more sophisticated statistical inference procedure might get rid of some of the errors that Matt highlights (I have some ideas that are too tentative to discuss). More generally, given how quickly our understanding of language acquisition is changing, how can anyone say surely what "real knowledge of language" entails? It's time to retire the essentialism of "colorless green ideas".

Thursday, August 9, 2007

Echoes from the dance of the elephants

Echoes from the dance of the elephants: A few days ago, I learned that I was the author of a chapter in a book whose existence I had previously not suspected, and that as a result, a medium-sized European publishing conglomerate had paid a not-entirely-trivial sum of money to a much larger European publishing conglomerate. This makes me feel, in a small way, like an athlete who learns that he has been traded from one team to another. Except that I don't have to move. [...] his sort of publishing has become a strange ceremonial dance among business conglomerates, the libraries of research universities, and the governments who pay the library costs. It plays almost no role at all in actual scientific and scholarly communication, at least in the fields that I work in. [...] The libraries who buy these publications are mostly, in the end, funded by taxpayers. Certainly in the U.S., the budgets of university research libraries form part of the overhead that universities charge on government research grants (which of course also pay for much if not most of the research whose results are published or reprinted in these volumes). In general, research libraries are wonderful institutions, more than worth what they cost; but the process that we're talking about is driving their costs way up, with little benefit to anyone except the publishing conglomerates. (Via Language Log.)

A couple of months ago, I linked to a critique of academic libraries by Clay Shirky:

Academic libraries, which in earlier days provided a service, have outsourced themselves as bouncers to publishers like Reed-Elsevier; their principal job, in the digital realm, is to prevent interested readers from gaining access to scholarly material.

Adam Corson-Finnerty from the Penn Libraries commented on my post, criticizing that "slam" on academic libraries. The Penn Libraries are outstanding, and they have been very progressive in their development and adoption of appropriate technologies, but all academic libraries have to seriously ask themselves whose interests they are serving when they continue "business as usual" with the rent-seekers in the academic publishing cartel. The example Mark discusses shows another facet of the problem. The only reason Routledge publishes such useless collections is that a few hundred sleepwalking academic librarians are willing to write a big check for a very strained acquisitions budgets. If any faculty member asks their library to buy such a wasteful collection, the librarian should push back, awkward as that might be. Libraries need to not only embrace open access, institutional archiving, and self-archiving, but lead by example and persuasion. The Penn Libraries have done more than most in these areas, but we all need to do more to retake control of the diffusion of our intellectual production.

Wednesday, August 8, 2007

Why do online-only OA journals use PDF?

Why do online-only OA journals use PDF?: Andy Powell, Open, online journals != PDF ?  eFoundations, August 6, 2007.  Speaking of the International Journal of Digital Curation (IJDC):


Odd though, for a journal that is only ever (as far as I know) intended to be published online, to offer the articles using PDF rather than HTML.  Doing so prevents any use of lightweight 'semantic' markup within the articles, such as microformats, and tends to make re-use of the content less easy.
(Via Open Access News.)

Doesn't seem so hard to figure out. HTML is awful for mathematics and scientific graphics. Just compare our recent paper in HTML and in PDF, even though the math in the PDF version is not as readable in PLoS's required Word format as it was in our original LaTex.

Tuesday, August 7, 2007

Would you rather be a theorist or an experimentalist?

Why it’s OK not to be Sean: There's an old chestnut that theorists are judged by their best paper and observers/experimentalists by their worst. (Via Cosmic Variance.)

Must be an old chestnut for physicists, I hadn't heard it before. I thought for a moment that playing theorist might get me a free lunch, but then I realized that my best paper would be compared with the best papers of all theorists. I think I'll continue to take my chances with the experimentalists...

Monday, August 6, 2007

Evo-devo and computation

Sci Foo recap: If I were to do it all again, I'd offer up an intro to evo-devo, in particular because some of the more gung-ho genomics talks seemed so oblivious to the difficulties of the fancier projects they were saying would be in our future. I really think the organismal-form-from-DNA problem is going to make the protein folding problem look trivial, and this is especially going to be true if the DNA Mafia is going to pretend the developmental biologists don't exist. (Via Pharyngula.)

A computer science point of view makes this point easier to understand. At the genomic level, evo-devo focuses on the evolution of the switches that control gene expression spatially and temporally in development in development. To a first, discrete approximation, these switches form Boolean combinations of transcription factors (themselves the expressions of genes) that gate the expression of another gene. There are also feedbacks and delays in the system. So, we have a pretty powerful computational device, and we know that recreating (learning) such a system from its behavior is in even relatively simple cases (finite-state machines) extremely hard.

We might hope that the system is constrained in ways that make it easier to reconstruct from behavior than the worst-case results suggest. But I see no functional reason why that should be the case. "Easy to reverse engineer" doesn't seem to have an evolutionary advantage, and it may actually be disadvantageous, in that it could facilitate the evolution of parasites and other attackers. (Think of the defensive advantages of encrypted communication).

Monday, July 30, 2007

What are scientific meetings for

Anatomy of a Paper: Part I, Inspiration: I will never understand how people can suggest replacing conferences or seminar visits with talks broadcast over the internet. That’s like trying to improve a restaurant experience by making sure the plates and cutlery are really shiny, and doing away with the food entirely. Conferences aren’t about talks, although those are occasionally interesting. [...] They’re about the ongoing low-level interaction between the participants at meals and coffee breaks. That’s where the ideas get created! Then you can each go home and apply yourself to the nitty-gritty work of turning those ideas into papers. (Via Cosmic Variance.)

Thursday, July 26, 2007

Math courses may help with science (AP)

Math courses may help with science (AP): Students who had more math courses in high school did better in all types of science once they got to college, researchers say. (Via Yahoo! News - Science.)

Next after these messages: people who learn how to swim do better in water sports, study shows.

Wednesday, July 25, 2007

More on Pat Schroeder's comments on the NIH policy

More on Pat Schroeder's comments on the NIH policy: William Walsh, Schroeder follows Dezenhall's script, Issues in Scholarly Communication, July 24, 2007.  Excerpt:

There's a nice story on the NIH proposal this morning in Inside Higher Ed. (See Peter Suber's comments on it.) In it, Pat Schroeder, president of the AAP, seems to be following the script laid out for publishers by pricey consultant Eric Dezenhall.
Schroeder, of the publishers’ association, acknowledged that opinion in higher education has shifted in favor of open access. But she said that was based on a lack of knowledge. “Any time you tell somebody they are going to get something for free, they think ‘yahoo.’ ” The problem, she said, is that “no one understands what publishers do.” If academics realized what publishers did with the money they charge — in terms of running peer review systems — they would fear endangering them.

(Via Open Access News.)

My experiences with the peer-review systems of the open access journals JMLR, BMC Bioinformatics, and PLoS Computational Biology are all much better than those I've had with many closed access journals over the years. The quality of a peer review system comes from the commitment and skill of the scientific editors and from a well-chosen workflow system, not from paper pushers at headquarters, who in some cases serve mainly to slow down the process.

Sunday, July 22, 2007

TSA follies

I flew from SFO to PHL yesterday on United, checking one bag. At the bag claim in PHL, at first I thought my bag hadn't made it. but it had, except that it had on it an unexpected bright green TSA-approved lock. Which I could not open. After talking with the pleasant United baggage man, we figured out what might have happened: the TSA bag screeners took the lock from another bag using their master key and put it on my (possibly similar) bag by mistake. Unfortunately, it was after 11 pm and the TSA office in PHL was closed. Today I called them, but I ended up in voice-mail hell. Just went to the local Home Depot to get some bold cutters, the silly lock has gone to join the choir invisible of vacuous security precautions.

Given how trivial it was to remove the lock with bolt cutters, why are people wasting their money on these?

Saturday, July 21, 2007

EU Google Competitor Project Gets Aid Worth $166 Million

EU Google Competitor Project Gets Aid Worth $166 Million: [...] Dow Jones reports: "The aim is to develop new search technologies for the next generation Internet, including 'semantic technologies which try to recognize the meaning of content and place it in its proper context.' The semantic Web has been considered the next evolution of the Internet at least since Tim Berners-Lee, widely considered a creator of the current version of the Internet, published an article describing it in 2001. In theory, a semantic Web could receive a user request for information about fishing, for example, and automatically narrow the results according to the user's individual needs rather than blanket the user with pages related to numerous aspects of fishing. The Commission's funding approval Thursday immediately sparked talk of building a potential European challenger to Web search leader Google Inc." (Via Slashdot.)

I fear that the EU is suffering from magical thinking of the kind identified by Drew McDermott in his classic Artificial Intelligence and Natural Stupidity (not available online), which should be required reading for everyone studying or investing in this area. Calling some technology "semantic" doesn't make it so. All search engines try to "recognize the meaning of content and place it in its proper context." It's just that doing so accurately and efficiently in general is extremely hard. Significant progress depends on unpredictable research advances, not on predictable development efforts. Putting around 1000 person years on a focused project like this creates false expectations and actually hurts basic research in the field.

Competition is search is good. The major search engines have substantial research efforts, as could for instance be seen from their publications at the recent natural-language processing conferences in Prague, and there are several startups exploring new approaches in the field. More research in this area is good. But the EU should have learned from the limited success of big initiatives from EUROTRA to the framework programs that major advances cannot be willed by bureaucratic fiat.

The seeds of current search technology were not in major coordinated development efforts, but in academic research at schools like Berkeley, CMU, Cornell, and Stanford, and in unpredictable benefits from industrial research at Bell Labs, IBM, and PARC in areas like machine learning and information retrieval. None of this work came from a big grand plan, but rather from the initiative of researchers and research managers in exploiting the resources available to them (and it could be argued that the current funding climate in the US, which puts greater emphasis on top-down initiatives and applicability than before, may well reduce the creativity of the research system here). The most important effect of these efforts was not in technologies, but in creating opportunities for creative people (students, faculty, researchers) to play with new ideas and recognize their potential. Without institutional reform in Europe to open up comparable opportunities through increased flexibility in education, research, and funding, much of these $166 million will end up as institutional welfare payments to hidebound universities and corporations, as has been the case for much of the previous EU investments in research and development.

Monday, July 16, 2007

Monday, July 9, 2007

RIM's CEO sees iPhone as "dangerous"

RIM's CEO sees iPhone as "dangerous": Research in Motion head Jim Baisilie believes the iPhone could be potentially toxic for the cellphone industry, according to a recent interview. The head of the BlackBerry firm points out that while AT&T has obtained a multi-year contract for the device, the terms leave the carrier out of much of the sales process and give it little influence over the customization of the phone's hardware or software. [...] "It's a dangerous strategy. It's a tremendous amount of control," he says. "And the more control of the platform that goes out of the carrier, the more they shift into a commodity pipe." (Via MacNN.)

I'm scared. Imagine, carriers having to focus on sending packets to their destination instead of forcing on us awfully designed, crippled, restrictive applications and services. Carriers providing an arena for unfettered innovation rather than keeping us frozen in 1960s communication and software models. That can't be allowed. After all, what would that do to proprietary platforms like RIM's?

Friday, July 6, 2007

Automated versus Human Judgments

Automated versus Human Judgments: A couple of posts provoke an interesting discussion: William Cohen points to the issue of the popularity contest approach to ranking which may have undesirable consequences [...] As for the issue of automated ranking of web pages. The problem cited above exposes the frailty of addressing a content problem (finding a document whose text is appropriate) via an orthogonal structural solution. The structural solution (counting links and propagating results) may do well in some domains where it is regarded as a proxy for measurements of 'authority', however, the ambiguity in the structure cannot be determined, leading to the type of problem William cites. This is where solutions like Powerset come in. (Via Data Mining.)

Thanks for the link to William's blog which I didn't know about. Regarding this particular search ranking issue, where related lexical items have very different contexts of usage and associated sentiments — negative vs neutral or positive — I'm curious of what NLP methods, embodied Powerset's system or in any other system, or even in early research prototypes, would Matt recommend for a solution. The problem is not one of syntax, semantics, or even pragmatics and local discourse, as it can be easily seen from several controversies in this country where a word is considered derogatory when used by some people but friendly or even complimentary when used by others; and people can and will get into hot water when they breach those invisible but very real boundaries. There's a lot more in the context and charge of writing than any of our current automated methods can discern, whether they use global statistics or local structure. It's not a matter of ambiguity — the denotation of the terms is not in question — but one of association and rhetorical force — what ideas and feelings are triggered in the minds of different readers and writers by particular terms as a result of their social and cultural backgrounds and of their (lack of) sensitivity.

The original post by Lauren Weinstein that triggered this thread was about the visible global impact of search rankings, but William's discussion suggests a less global but possibly more powerful effect in search personalization, of whether a personalization algorithm could become a strong reinforcer of prejudice without the counter-pressure of critical discussion of globally visible search rankings.

Thursday, July 5, 2007

david pogue, the musical

david pogue, the musical: offered without comment


(Via tingilinde.)

I like the at&t bit...

Elsevier invites Google and Google Scholar to index its journals

Elsevier invites Google and Google Scholar to index its journals: Peter Brantley, Science Direct-ly into Google, O'Reilly Radar, July 3, 2007.  Excerpt: ScienceDirect (SD) is a compendium of scientific, technical, and medical (STM) literature from Reed Elsevier [...] Ale de Vries, the SD product manager, informs me in an email: “About Google/Google Scholar: we're making good progress. As you may be aware, we did a pilot with some journals on SD first, and now we are working to get them all indexed. We're making good progress there - it's a lot of content to be crawled, but going along nicely. Both Google Scholar and main Google are gradually covering more and more of our journals. ” (Via Open Access News.)

Several other closed-access journals are already indexed by Google Scholar. For an academic user like me, it is very convenient to have unified search across all scholarly sources, open or closed, rather than having to search separately my institutional e-resource index and the open-access literature. It is plausible to assume that accesses to closed resources have been dropping as a fraction of all accesses as open resources become more available. Certainly, I am much more likely to search on Google Scholar than on SD even though Penn gives me access to it. For Reed Elsevier, Google indexing will bring more traffic into SD from users like me, which will help them justify their high subscriptions to budget-pressed academic libraries.

Wednesday, July 4, 2007

Old Lisbon tour

From Olhares Sobre Lisboa (Via aguarelas de Turner.)

My sister blogs at aguarelas de Turner and she just linked to a beautiful photo tour along the route of tram 28 in Lisbon. Memory sleeps for a long time, and then the right images wake it up with a start.

Monday, July 2, 2007

Paper comments

I've posted a revised list of ACL and EMNLP-CoNLL highlights with brief comments on each paper.

Sunday, July 1, 2007

Prague

Spent the last week in Prague for ACL and EMNLP-CoNLL (paper highlights). I didn't take my camera (last-minute packing issues). Spent most of the time listening to talks, talking to colleagues, and fine-tuning talks, with just one Sunday to look around and several dinners in town. Impressions: large, beautiful historic city center, the best preserved I know in Europe, perfect for walking around; ugly concrete block periphery; excellent public transportation; mediocre to Fawlty-Toweresque service in hotels and restaurants.

Thursday, June 21, 2007

More on data catalysis

In an earlier entry on Patrick Pantel's data catalysis proposal, I wrote

Our problem is not the lack of particle accelerators, but the lack of the organizational and funding processes associated with particle accelerators.

Mark Liberman disagrees:

However, I continue to believe that Patrick is addressing an important set of issues. As both Patrick and Fernando observe, the hardware that we need is not prohibitively expensive. But there remain significant problems with data access and with infrastructure design.

On the infrastructure side, let's suppose that we've got $X to spend on some combination of compute servers and file servers.. What should we do? Should we buy X/5000 $5K machines, or X/2000 $2K machines, or what? Should the disks be local or shared? How much memory does each machine need? What's the right way to connect them up? Should we dedicate a cluster to Hadoop and map/reduce, and set aside some other machines for problems that don't factor appropriately? Or should we plan to to use a single cluster in multiple ways? What's really required in the way of on-going software and hardware support for such a system?

These are issues that a mildly competent team with varied expertise can figure out if they have the resources to start with. The real problem is how to assemble and maintain such a team and infrastructure in an academic environment in which funding is unpredictable. I generally disagree with Field of Dreams "if you build it, they will come" projects. While we can all agree that some combination of compute and storage servers and distributed computing software can be very useful for large-scale machine learning and natural-language processing experiments, the best way to configure them depends on the specific projects that are attempted. Generic infrastructure hardware, software, and management efforts are not really good at anything in particular, and they have a way of sucking resources to perpetuate themselves independently of the science they claim to serve.

I'd rather see our field push for funding opportunities that pay attention in a balanced way to both the science and the needed infrastructure.

Tuesday, June 19, 2007

Val Thorens Tarps Glacier

Val Thorens Tarps Glacier: Local ski instructor Philippe Martin summed up the problem “This map, it is 20 years old, you see all these areas marked as glaciers, come back this summer and you will see they are gone”. Val Thorens may be the highest ski resort in Europe but even its lofty summits and twinkling glaciers are suffering from the changing climate. The resort is fighting back. From the spring of 2008 the Savoie resort, on the edge of the Vanoise National Park, plans to cover part of its glacier with a giant tarpaulin. (Via www.PisteHors.com.)

For anyone who doesn't care about skiing, what about drinking water?


Monday, June 18, 2007

Data Catalysis

Data Catalysis: I'm back in Philadelphia, after a quick jaunt to Kyoto for ISUC2007. One of the most interesting presentations there was Patrick Pantel's "Data Catalysis: Facilitating Large-Scale Natural Language Data Processing" [...] Patrick's idea, as I understand it, is not to create yet another supercomputer center. Instead, his goal is a model that other researchers can replicate, seen as a part of "large data experiments in many computer science disciplines ... requiring a consortium of projects for studying the functional architecture, building low-level infrastructure and middleware, engaging the research community to participate in the initiative, and building up a library of open-source large data processing algorithms". (Via Language Log.)

I've just read Patrick's proposal. As Mark noted in his blog, I've made some early comments about the instrumental limitations of current academic natural language processing and machine learning research.

However, I'm worried about grid-anything. In other fields, expensive grid efforts have been more successful at creating complex computational plumbing and bureaucracies than at delivering new science.

Most of the computational resources of search engines are needed for service delivery, not for data analysis. As far as I can see, a few dozen up-to-date compute and storage servers, costing much less than $1M, would be enough for the largest-scale projects that anyone in academia would want to do. With the publication of MapReduce, the development of Hadoop (both discussed by Patrick), and free cluster management software like Grid Engine, we have all the main software pieces to start working now.

So, I don't think the main bottleneck is (for the moment) computation and storage. The bottlenecks are elsewhere:

  • Good problems: A good problem is one where validation data is plentiful as a result of natural language use. MT is the obvious example. Translation happens whether there is MT or not, so parallel texts are plentiful. Sentiment classification is another example. People cannot resist labeling movies, consumer gadgets, with thumbs-up or thumbs-down. However, much of what gets done and published in NLP today — for good but short-sighted reasons — exploits laboriously created training and test sets, with many deleterious results, including implicit hill-climbing on the task, and results that just show the ability of a machine-learning algorithm to reconstruct the annotators' theoretical prejudices.
  • Project scale: Our main limitation are models and algorithms for complex tasks, not platforms. Typical NSF grant budgets cannot support the concurrent exploration and comparison of several alternatives for the various pieces of a complex NLP system. MT, which Patrick knows well from his ISI colleagues, is a good example: the few successful players have major industrial or DoD funding, the rest are left behind not for a lack of software platforms, but for a lack of people resources. In addition to research, the programming effort needed goes beyond what is supportable with typical academic funding, as my friend Andrew McCallum has found out in his Rexa project.
  • Funding criteria: There is a striking difference between how good venture capitalists decide to invest and how NSF or NIH decide to invest. A good venture capitalist looks first for the right team, then for the right idea. A good team can redirect work quickly when the original idea hits a roadblock, but an idea can kill a venture if the team is too fond of it. In contrast, recent practices in research proposal review focus on the idea, not the team. I understand the worry that focusing on the team can foster "old boys network" effects or worse; and of course, bad ideas cannot be rescued by any team. But the best ideas are often too novel to be accepted fully by reviewers who have their own strong views of what are the most promising directions. It is not surprising that the glory days of innovation in important areas of computer science (such as networking, VLSI design, operating systems) coincided with when funding agencies gave block grants to groups with a track record and let them figure out for themselves how to best organize their research.
  • Who buys the hardware: Research grants are too small to support the necessary computers and storage, while infrastructure grants need a very broad justification, which a large community of intended users. My experience with infrastructure grants is that the equipment bought in those terms ends up being suboptimal or even obsolete by the time that the intended research gets funded; while the research won't get funded if it is predicated on equipment that is not yet available.

Data catalysis is a nice name for a very important approach to NLP research, but I've learned a lot about the current limits of academic research since I made the comment that Mark mentioned. Our problem is not the lack of particle accelerators, but the lack of the organizational and funding processes associated with particle accelerators.

Saturday, June 16, 2007

how about an order of haggis with that?

how about an order of haggis with that?: yummmm (Via tingilinde.)

I enjoyed five years in Edinburgh without tasting one of those, but the fish and chips was plenty greasy, not to mention the leaden roly-poly pudding at the student cafeteria.

On the other hand, kippers or rhubarb crumble...

Research blog

I'll be blogging at Structured Learning what research papers I am reading and writing.

Do we need a Repositories Plan B?

Do we need a Repositories Plan B?: Andy Powell, Repository Plan B? eFoundations, June 15, 2007.  Excerpt:

"The most successful people are those who are good at Plan B." -- James Yorke, mathematician
[...] Imagine a world in which we talked about 'research blogs' or 'research feeds' rather than 'repositories', in which the 'open access' policy rhetoric used phrases like 'resource outputs should be made freely available on the Web' rather than 'research outputs should be deposited into institutional or other repositories', and in which accepted 'good practice' for researchers was simply to make research output freely available on the Web with an associated RSS or Atom feed.

Wouldn't that be a more intuitive and productive scholarly communication environment than what we have currently? ...

Since [arXiv], we have largely attempted to position repositories as institutional services, for institutional reasons, in the belief that metadata harvesting will allow us to aggregate stuff back together in meaningful ways.

Is it working?  I'm not convinced.  Yes, we can acknowledge our failure to put services in place that people find intuitively compelling to use by trying to force their use thru institutional or national mandates?  But wouldn't it be nicer to build services that people actually came to willingly?

In short, do we need a repositories plan B?

(Via Open Access News.)

axXiv has RSS feeds, which I rely on. Research blogs like Machine Learning (Theory) recommend interesting papers from time to time. But what I would like is to have feeds for the new papers and readings of researchers whose work I want to follow. Institutional repositories aggregate material in the wrong way for this, and authors lack convenient tools to generate feeds automatically as they post new papers.

I think I'll start a blog that just lists the papers found interesting or I have recently written.

Friday, June 15, 2007

Testing MarsEdit

I edited The High Sierra from 33,000 feet to upload and insert a picture of Mount Dana. Worked perfectly. MarsEdit 1.2 rocks!

Thursday, June 14, 2007

Cost-free copies

Clay Shirky demolishes the prejudices of print scarcity:

In a world where copies have become cost-free, people who expend their resources to prevent access or sharing are forgoing the principal advantages of the new tools, and this dilemma is common to every institution modeled on the scarcity and fragility of physical copies. Academic libraries, which in earlier days provided a service, have outsourced themselves as bouncers to publishers like Reed-Elsevier; their principal job, in the digital realm, is to prevent interested readers from gaining access to scholarly material.

Back to MarsEdit

MarsEdit 1.2 works very well with the new Blogger API. Unlike ecto, which I had been using while MarsEdit was being updated, it supports Blogger labels and image upload. And the editing interface is much nicer than ecto's. Thank you, Daniel!

Police net

I don't believe in the death penalty... for people: If there were a death penalty for corporations, AT&T may have just earned it. [...] Imagine, they have designs of selling access to movies and stuff over the Internet, so they decide to join with the MPAA and the RIAA to spy on and prosecute their customers. (Via Scripting News).

In a police state (as I know from personal experience), your every move, word, writing, communication, relationship is open to surveillance and potentially suspect.

What AT&T is considering is the net equivalent of a police state. Even if we accepted the overbearingly expansive view of copyright that the RIAA and MPAA are trying to impose on us, no automated method can distinguish reliably between deliberate infringement and guilt by association. How many innocent people would be caught in the dragnet and treated as criminals because their computers are zombies?

The excellent arguments in this paper apply directly to network-embedded infringement detection and enforcement.

The High Sierra from 33,000 feet

IMG_0134.JPG

Flying again from SFO to PHL, the Sierra from 33,000 feet has lost most of its snow. Mount Dana's two main couloirs are still full of white, but granite is now the dominant note.

No mountains in my travel plans until August 18's arrival in Bariloche. Work can obscure their absence for days, but they always return. Looking West of Mono Lake for Dana (picture: in midwinter from the plateau) and Tioga Pass is an irresistible draw, not a simple pleasure. Each memorable turn costing a blurr of breaths and steps on an icy slope.