Monday, June 18, 2007

Data Catalysis

Data Catalysis: I'm back in Philadelphia, after a quick jaunt to Kyoto for ISUC2007. One of the most interesting presentations there was Patrick Pantel's "Data Catalysis: Facilitating Large-Scale Natural Language Data Processing" [...] Patrick's idea, as I understand it, is not to create yet another supercomputer center. Instead, his goal is a model that other researchers can replicate, seen as a part of "large data experiments in many computer science disciplines ... requiring a consortium of projects for studying the functional architecture, building low-level infrastructure and middleware, engaging the research community to participate in the initiative, and building up a library of open-source large data processing algorithms". (Via Language Log.)

I've just read Patrick's proposal. As Mark noted in his blog, I've made some early comments about the instrumental limitations of current academic natural language processing and machine learning research.

However, I'm worried about grid-anything. In other fields, expensive grid efforts have been more successful at creating complex computational plumbing and bureaucracies than at delivering new science.

Most of the computational resources of search engines are needed for service delivery, not for data analysis. As far as I can see, a few dozen up-to-date compute and storage servers, costing much less than $1M, would be enough for the largest-scale projects that anyone in academia would want to do. With the publication of MapReduce, the development of Hadoop (both discussed by Patrick), and free cluster management software like Grid Engine, we have all the main software pieces to start working now.

So, I don't think the main bottleneck is (for the moment) computation and storage. The bottlenecks are elsewhere:

  • Good problems: A good problem is one where validation data is plentiful as a result of natural language use. MT is the obvious example. Translation happens whether there is MT or not, so parallel texts are plentiful. Sentiment classification is another example. People cannot resist labeling movies, consumer gadgets, with thumbs-up or thumbs-down. However, much of what gets done and published in NLP today — for good but short-sighted reasons — exploits laboriously created training and test sets, with many deleterious results, including implicit hill-climbing on the task, and results that just show the ability of a machine-learning algorithm to reconstruct the annotators' theoretical prejudices.
  • Project scale: Our main limitation are models and algorithms for complex tasks, not platforms. Typical NSF grant budgets cannot support the concurrent exploration and comparison of several alternatives for the various pieces of a complex NLP system. MT, which Patrick knows well from his ISI colleagues, is a good example: the few successful players have major industrial or DoD funding, the rest are left behind not for a lack of software platforms, but for a lack of people resources. In addition to research, the programming effort needed goes beyond what is supportable with typical academic funding, as my friend Andrew McCallum has found out in his Rexa project.
  • Funding criteria: There is a striking difference between how good venture capitalists decide to invest and how NSF or NIH decide to invest. A good venture capitalist looks first for the right team, then for the right idea. A good team can redirect work quickly when the original idea hits a roadblock, but an idea can kill a venture if the team is too fond of it. In contrast, recent practices in research proposal review focus on the idea, not the team. I understand the worry that focusing on the team can foster "old boys network" effects or worse; and of course, bad ideas cannot be rescued by any team. But the best ideas are often too novel to be accepted fully by reviewers who have their own strong views of what are the most promising directions. It is not surprising that the glory days of innovation in important areas of computer science (such as networking, VLSI design, operating systems) coincided with when funding agencies gave block grants to groups with a track record and let them figure out for themselves how to best organize their research.
  • Who buys the hardware: Research grants are too small to support the necessary computers and storage, while infrastructure grants need a very broad justification, which a large community of intended users. My experience with infrastructure grants is that the equipment bought in those terms ends up being suboptimal or even obsolete by the time that the intended research gets funded; while the research won't get funded if it is predicated on equipment that is not yet available.

Data catalysis is a nice name for a very important approach to NLP research, but I've learned a lot about the current limits of academic research since I made the comment that Mark mentioned. Our problem is not the lack of particle accelerators, but the lack of the organizational and funding processes associated with particle accelerators.


steve said...

There has traditionally been a problem with toy driven lust. It is far easier to articulate hammer and tong approaches rather than attempting to create an infrastructure that would focus on understanding fundamental issues.

Bill said...

As a researcher/manager/cook/bottle-washer at The Texas Advanced Computing Center, a member of the NSF's TeraGrid project and an organization that's about to deploy the world's largest supercomputer, I thought I'd point out that any US-based academic can apply for time on TeraGrid systems. It's free! They're quite capable and regularly being refreshed to keep up with the latest in hardware. I even know of one UT Austin computational linguistics researcher that's already using our systems. Don't let the "Grid" in TeraGrid confuse you. Your research need not have any relation to or utilize any of the Grid services of the resources we provide. All of the NSF's computational resources are part of the TeraGrid project which serves as the umbrella organization for providing supercomputing, networking, storage, etc. resources to the national research community.

Coming from a graduate school background where we generally eschewed the use of the resources at the big centers, I can understand the desire to have one's own computers, but having moved to helping run a center, I think we have a lot to offer.

You can apply for a DAC allocation (read "start-up"--this used to stand for something, but I think the meaning has been lost) at any time, and are likely to see approval in a few weeks time. Larger-scale allocations are reviewed and approved quarterly, and the largest scale awards are reviewed and approved twice a year. These allocations are generally granted (though sometimes they are reduced in size to fit within available resources). Very few are completely rejected. I don't think any start-up/DAC allocations are ever rejected.

I don't want to come across too much like an evangelist, but you seem a little confused about the purpose of the infrastructure that's been built and/or the purpose that the NSF funding has gone to. We'd love to have some more computational linguists using our machines. We're always looking for new/emerging communities to work with.

Fernando Pereira said...

@bill: Centralized computing is great when we have a clear computational goal that fits well what the center offers. That is, for the "production" stage of a research project. However, much of experimental machine learning and natural-language processing research is about finding useful models. Relying on a remote computing resource inevitably slows down the translation from model to experimental implementation.

There's more to say, which I'll blog later.

Bill said...

Considering that I watch and/or participate with remote users who develop codes for their science problems on our machines on a daily basis, I must respectfully disagree. We have a wide variety (thousands) of users who develop at least the large scale portions of their applications on our systems. They do often manage quite well.

If you don't want to come and use our systems, that's fine, I'll leave you be. However, if you feel that there are limitations that prevent you from doing your work in a center environment, I'd be extremely interested to hear about them.