Thursday, June 21, 2007

More on data catalysis

In an earlier entry on Patrick Pantel's data catalysis proposal, I wrote

Our problem is not the lack of particle accelerators, but the lack of the organizational and funding processes associated with particle accelerators.

Mark Liberman disagrees:

However, I continue to believe that Patrick is addressing an important set of issues. As both Patrick and Fernando observe, the hardware that we need is not prohibitively expensive. But there remain significant problems with data access and with infrastructure design.

On the infrastructure side, let's suppose that we've got $X to spend on some combination of compute servers and file servers.. What should we do? Should we buy X/5000 $5K machines, or X/2000 $2K machines, or what? Should the disks be local or shared? How much memory does each machine need? What's the right way to connect them up? Should we dedicate a cluster to Hadoop and map/reduce, and set aside some other machines for problems that don't factor appropriately? Or should we plan to to use a single cluster in multiple ways? What's really required in the way of on-going software and hardware support for such a system?

These are issues that a mildly competent team with varied expertise can figure out if they have the resources to start with. The real problem is how to assemble and maintain such a team and infrastructure in an academic environment in which funding is unpredictable. I generally disagree with Field of Dreams "if you build it, they will come" projects. While we can all agree that some combination of compute and storage servers and distributed computing software can be very useful for large-scale machine learning and natural-language processing experiments, the best way to configure them depends on the specific projects that are attempted. Generic infrastructure hardware, software, and management efforts are not really good at anything in particular, and they have a way of sucking resources to perpetuate themselves independently of the science they claim to serve.

I'd rather see our field push for funding opportunities that pay attention in a balanced way to both the science and the needed infrastructure.

Tuesday, June 19, 2007

Val Thorens Tarps Glacier

Val Thorens Tarps Glacier: Local ski instructor Philippe Martin summed up the problem “This map, it is 20 years old, you see all these areas marked as glaciers, come back this summer and you will see they are gone”. Val Thorens may be the highest ski resort in Europe but even its lofty summits and twinkling glaciers are suffering from the changing climate. The resort is fighting back. From the spring of 2008 the Savoie resort, on the edge of the Vanoise National Park, plans to cover part of its glacier with a giant tarpaulin. (Via

For anyone who doesn't care about skiing, what about drinking water?

Monday, June 18, 2007

Data Catalysis

Data Catalysis: I'm back in Philadelphia, after a quick jaunt to Kyoto for ISUC2007. One of the most interesting presentations there was Patrick Pantel's "Data Catalysis: Facilitating Large-Scale Natural Language Data Processing" [...] Patrick's idea, as I understand it, is not to create yet another supercomputer center. Instead, his goal is a model that other researchers can replicate, seen as a part of "large data experiments in many computer science disciplines ... requiring a consortium of projects for studying the functional architecture, building low-level infrastructure and middleware, engaging the research community to participate in the initiative, and building up a library of open-source large data processing algorithms". (Via Language Log.)

I've just read Patrick's proposal. As Mark noted in his blog, I've made some early comments about the instrumental limitations of current academic natural language processing and machine learning research.

However, I'm worried about grid-anything. In other fields, expensive grid efforts have been more successful at creating complex computational plumbing and bureaucracies than at delivering new science.

Most of the computational resources of search engines are needed for service delivery, not for data analysis. As far as I can see, a few dozen up-to-date compute and storage servers, costing much less than $1M, would be enough for the largest-scale projects that anyone in academia would want to do. With the publication of MapReduce, the development of Hadoop (both discussed by Patrick), and free cluster management software like Grid Engine, we have all the main software pieces to start working now.

So, I don't think the main bottleneck is (for the moment) computation and storage. The bottlenecks are elsewhere:

  • Good problems: A good problem is one where validation data is plentiful as a result of natural language use. MT is the obvious example. Translation happens whether there is MT or not, so parallel texts are plentiful. Sentiment classification is another example. People cannot resist labeling movies, consumer gadgets, with thumbs-up or thumbs-down. However, much of what gets done and published in NLP today — for good but short-sighted reasons — exploits laboriously created training and test sets, with many deleterious results, including implicit hill-climbing on the task, and results that just show the ability of a machine-learning algorithm to reconstruct the annotators' theoretical prejudices.
  • Project scale: Our main limitation are models and algorithms for complex tasks, not platforms. Typical NSF grant budgets cannot support the concurrent exploration and comparison of several alternatives for the various pieces of a complex NLP system. MT, which Patrick knows well from his ISI colleagues, is a good example: the few successful players have major industrial or DoD funding, the rest are left behind not for a lack of software platforms, but for a lack of people resources. In addition to research, the programming effort needed goes beyond what is supportable with typical academic funding, as my friend Andrew McCallum has found out in his Rexa project.
  • Funding criteria: There is a striking difference between how good venture capitalists decide to invest and how NSF or NIH decide to invest. A good venture capitalist looks first for the right team, then for the right idea. A good team can redirect work quickly when the original idea hits a roadblock, but an idea can kill a venture if the team is too fond of it. In contrast, recent practices in research proposal review focus on the idea, not the team. I understand the worry that focusing on the team can foster "old boys network" effects or worse; and of course, bad ideas cannot be rescued by any team. But the best ideas are often too novel to be accepted fully by reviewers who have their own strong views of what are the most promising directions. It is not surprising that the glory days of innovation in important areas of computer science (such as networking, VLSI design, operating systems) coincided with when funding agencies gave block grants to groups with a track record and let them figure out for themselves how to best organize their research.
  • Who buys the hardware: Research grants are too small to support the necessary computers and storage, while infrastructure grants need a very broad justification, which a large community of intended users. My experience with infrastructure grants is that the equipment bought in those terms ends up being suboptimal or even obsolete by the time that the intended research gets funded; while the research won't get funded if it is predicated on equipment that is not yet available.

Data catalysis is a nice name for a very important approach to NLP research, but I've learned a lot about the current limits of academic research since I made the comment that Mark mentioned. Our problem is not the lack of particle accelerators, but the lack of the organizational and funding processes associated with particle accelerators.

Saturday, June 16, 2007

how about an order of haggis with that?

how about an order of haggis with that?: yummmm (Via tingilinde.)

I enjoyed five years in Edinburgh without tasting one of those, but the fish and chips was plenty greasy, not to mention the leaden roly-poly pudding at the student cafeteria.

On the other hand, kippers or rhubarb crumble...

Research blog

I'll be blogging at Structured Learning what research papers I am reading and writing.

Do we need a Repositories Plan B?

Do we need a Repositories Plan B?: Andy Powell, Repository Plan B? eFoundations, June 15, 2007.  Excerpt:

"The most successful people are those who are good at Plan B." -- James Yorke, mathematician
[...] Imagine a world in which we talked about 'research blogs' or 'research feeds' rather than 'repositories', in which the 'open access' policy rhetoric used phrases like 'resource outputs should be made freely available on the Web' rather than 'research outputs should be deposited into institutional or other repositories', and in which accepted 'good practice' for researchers was simply to make research output freely available on the Web with an associated RSS or Atom feed.

Wouldn't that be a more intuitive and productive scholarly communication environment than what we have currently? ...

Since [arXiv], we have largely attempted to position repositories as institutional services, for institutional reasons, in the belief that metadata harvesting will allow us to aggregate stuff back together in meaningful ways.

Is it working?  I'm not convinced.  Yes, we can acknowledge our failure to put services in place that people find intuitively compelling to use by trying to force their use thru institutional or national mandates?  But wouldn't it be nicer to build services that people actually came to willingly?

In short, do we need a repositories plan B?

(Via Open Access News.)

axXiv has RSS feeds, which I rely on. Research blogs like Machine Learning (Theory) recommend interesting papers from time to time. But what I would like is to have feeds for the new papers and readings of researchers whose work I want to follow. Institutional repositories aggregate material in the wrong way for this, and authors lack convenient tools to generate feeds automatically as they post new papers.

I think I'll start a blog that just lists the papers found interesting or I have recently written.

Friday, June 15, 2007

Testing MarsEdit

I edited The High Sierra from 33,000 feet to upload and insert a picture of Mount Dana. Worked perfectly. MarsEdit 1.2 rocks!

Thursday, June 14, 2007

Cost-free copies

Clay Shirky demolishes the prejudices of print scarcity:

In a world where copies have become cost-free, people who expend their resources to prevent access or sharing are forgoing the principal advantages of the new tools, and this dilemma is common to every institution modeled on the scarcity and fragility of physical copies. Academic libraries, which in earlier days provided a service, have outsourced themselves as bouncers to publishers like Reed-Elsevier; their principal job, in the digital realm, is to prevent interested readers from gaining access to scholarly material.

Back to MarsEdit

MarsEdit 1.2 works very well with the new Blogger API. Unlike ecto, which I had been using while MarsEdit was being updated, it supports Blogger labels and image upload. And the editing interface is much nicer than ecto's. Thank you, Daniel!

Police net

I don't believe in the death penalty... for people: If there were a death penalty for corporations, AT&T may have just earned it. [...] Imagine, they have designs of selling access to movies and stuff over the Internet, so they decide to join with the MPAA and the RIAA to spy on and prosecute their customers. (Via Scripting News).

In a police state (as I know from personal experience), your every move, word, writing, communication, relationship is open to surveillance and potentially suspect.

What AT&T is considering is the net equivalent of a police state. Even if we accepted the overbearingly expansive view of copyright that the RIAA and MPAA are trying to impose on us, no automated method can distinguish reliably between deliberate infringement and guilt by association. How many innocent people would be caught in the dragnet and treated as criminals because their computers are zombies?

The excellent arguments in this paper apply directly to network-embedded infringement detection and enforcement.

The High Sierra from 33,000 feet


Flying again from SFO to PHL, the Sierra from 33,000 feet has lost most of its snow. Mount Dana's two main couloirs are still full of white, but granite is now the dominant note.

No mountains in my travel plans until August 18's arrival in Bariloche. Work can obscure their absence for days, but they always return. Looking West of Mono Lake for Dana (picture: in midwinter from the plateau) and Tioga Pass is an irresistible draw, not a simple pleasure. Each memorable turn costing a blurr of breaths and steps on an icy slope.

Monday, June 4, 2007

Study: Music, tech search terms riskiest (AP)

Study: Music, tech search terms riskiest (AP): Search terms related to music and technology are most likely to return sites with spyware and other malicious code, a new study finds. The most interesting passage of the story comes near the end:

Nonetheless, McAfee found it slightly safer to use search engines overall. Although about 4 percent of search results lead to sites deemed risky, that's down from 5 percent a year ago. [...] Risks are greater when clicking on keyword ads that make up much of search companies' revenues: According to McAfee, 7 percent of such links produce risky sites, down from 8.5 percent a year ago.

I bet there was a lot of tweaking in achieving these improvements, showing again that search is not some Leibnizian calculus ratiocinator.

Sunday, June 3, 2007

In DRM we trust: world collection societies wring hands over P2P copying

In DRM we trust: world collection societies wring hands over P2P copying: The world's collection societies gathered in Brussels this week to discuss how artists could get paid in a digital world. The result: compulsory licensing is (still) out, DRM is (still) in.

The combination of greed, technical ignorance, and anger about losing a cozy life will keep the music industries on the path to destruction, even when the life raft of compulsory licensing is at hand. Eric Batiste, who runs the umbrella licensing organization CISAC:

I agree, it's very difficult to compete with free... but we need more compelling offerings as well as better enforcement. The killer app is not there yet.

The "killer app" is the golden-egg goose of this industry, and just as real. As for better enforcement, it would be interesting to know what fraction of scarce law enforcement resources do these guys think they are entitled to in a world where real mayhem is all too frequent.

Batiste also compared casual infringement to speeding - we all want do it, but we know we'll get caught. This is a good analogy. To judge from the proportion of speeders on any freeway I've been on in the last ten years, it's obvious that the perceived odds of getting caught are in the speeder's favor.

Tweaking Is Innovating

Search Journalism: Tweaking Is Not Innovating: It sounds like the proverbial person pulling the blanket down to cover his feet, only to get a cold head - then pulling the blanket up over his head only to get cold feet.

Actually, it sounds more like biological evolution. Like our big heads with big brains make us better at solving problems, but make childbirth more dangerous. Or a mutation that confers resistance to malaria is implicated in sickle cell anemia. In a complex adversarial environment there are no magic bullets. Every potentially beneficial change has costs as well. I think Hansell did a pretty good job of capturing the "struggle for existence" of a search engine.

Saturday, June 2, 2007

Open Access CL Proposal

Open Access CL Proposal: Following up on the Whence JCLR discussion, Stuart Shieber, Fernando Pereira, Ryan McDonald, Kevin Duh and I have just submitted a proposal for an open access version of CL to the ACL exec committee, hopefully to be discussed in Prague. In the spirit of open access, you can read the official proposal as well as see discussion that led up to it on our wiki. Feel free to email me with comments/suggestions, post them here, or bring them with you to Prague!

If you agree with our proposal, please express your support on our blogs and, even better, at the ACL business meeting in Prague.

Friday, June 1, 2007

Anger over DRM-free iTunes tracks

Anger over DRM-free iTunes tracks: Apple is facing questions over user data placed in DRM-free tracks sold through iTunes.

What were they thinking? If a competitor had wanted to sabotage the positive news of DRM-free tracks, they could not have done better than this. The only plausible theory floating around is that this would allow Apple to partially detect casual exchange of music files, since determined redistributors could easily tamper with the embedded information. But it is hard to see how whatever benefits Apple would derive from this would be worth the bad publicity.

Alternatively, this could be an EMI imposition, not impossible given the habitual cluelessness of the majors. But if so Apple should have resisted, if nothing else to preserve their good name.

Why should we worry if we are not set on copyright infringement? There are all sorts of innocent ways for tracks to get to someone, for example when someone (family member, friend) gives them a used computer. An insecure tagging method is open to all sorts of mischief. And the natural expectation is that I should be able to give some DRM-free tracks as I can give a CD. Sure, that might be not with current draconian interpretations of copyright allow, but it is a natural expectation for non-technical people.

Personally I don't need to worry, I'm still buying CDs.