Data Catalysis: I'm back in Philadelphia, after a quick jaunt to Kyoto for ISUC2007. One of the most interesting presentations there was Patrick Pantel's "Data Catalysis: Facilitating Large-Scale Natural Language Data Processing" [...] Patrick's idea, as I understand it, is not to create yet another supercomputer center. Instead, his goal is a model that other researchers can replicate, seen as a part of "large data experiments in many computer science disciplines ... requiring a consortium of projects for studying the functional architecture, building low-level infrastructure and middleware, engaging the research community to participate in the initiative, and building up a library of open-source large data processing algorithms". (Via Language Log.)
I've just read Patrick's proposal. As Mark noted in his blog, I've made some early comments about the instrumental limitations of current academic natural language processing and machine learning research.
However, I'm worried about grid-anything. In other fields, expensive grid efforts have been more successful at creating complex computational plumbing and bureaucracies than at delivering new science.
Most of the computational resources of search engines are needed for service delivery, not for data analysis. As far as I can see, a few dozen up-to-date compute and storage servers, costing much less than $1M, would be enough for the largest-scale projects that anyone in academia would want to do. With the publication of MapReduce, the development of Hadoop (both discussed by Patrick), and free cluster management software like Grid Engine, we have all the main software pieces to start working now.
So, I don't think the main bottleneck is (for the moment) computation and storage. The bottlenecks are elsewhere:
- Good problems: A good problem is one where validation data is plentiful as a result of natural language use. MT is the obvious example. Translation happens whether there is MT or not, so parallel texts are plentiful. Sentiment classification is another example. People cannot resist labeling movies, consumer gadgets, with thumbs-up or thumbs-down. However, much of what gets done and published in NLP today — for good but short-sighted reasons — exploits laboriously created training and test sets, with many deleterious results, including implicit hill-climbing on the task, and results that just show the ability of a machine-learning algorithm to reconstruct the annotators' theoretical prejudices.
- Project scale: Our main limitation are models and algorithms for complex tasks, not platforms. Typical NSF grant budgets cannot support the concurrent exploration and comparison of several alternatives for the various pieces of a complex NLP system. MT, which Patrick knows well from his ISI colleagues, is a good example: the few successful players have major industrial or DoD funding, the rest are left behind not for a lack of software platforms, but for a lack of people resources. In addition to research, the programming effort needed goes beyond what is supportable with typical academic funding, as my friend Andrew McCallum has found out in his Rexa project.
- Funding criteria: There is a striking difference between how good venture capitalists decide to invest and how NSF or NIH decide to invest. A good venture capitalist looks first for the right team, then for the right idea. A good team can redirect work quickly when the original idea hits a roadblock, but an idea can kill a venture if the team is too fond of it. In contrast, recent practices in research proposal review focus on the idea, not the team. I understand the worry that focusing on the team can foster "old boys network" effects or worse; and of course, bad ideas cannot be rescued by any team. But the best ideas are often too novel to be accepted fully by reviewers who have their own strong views of what are the most promising directions. It is not surprising that the glory days of innovation in important areas of computer science (such as networking, VLSI design, operating systems) coincided with when funding agencies gave block grants to groups with a track record and let them figure out for themselves how to best organize their research.
- Who buys the hardware: Research grants are too small to support the necessary computers and storage, while infrastructure grants need a very broad justification, which a large community of intended users. My experience with infrastructure grants is that the equipment bought in those terms ends up being suboptimal or even obsolete by the time that the intended research gets funded; while the research won't get funded if it is predicated on equipment that is not yet available.
Data catalysis is a nice name for a very important approach to NLP research, but I've learned a lot about the current limits of academic research since I made the comment that Mark mentioned. Our problem is not the lack of particle accelerators, but the lack of the organizational and funding processes associated with particle accelerators.