Thursday, June 21, 2007

More on data catalysis

In an earlier entry on Patrick Pantel's data catalysis proposal, I wrote

Our problem is not the lack of particle accelerators, but the lack of the organizational and funding processes associated with particle accelerators.

Mark Liberman disagrees:

However, I continue to believe that Patrick is addressing an important set of issues. As both Patrick and Fernando observe, the hardware that we need is not prohibitively expensive. But there remain significant problems with data access and with infrastructure design.

On the infrastructure side, let's suppose that we've got $X to spend on some combination of compute servers and file servers.. What should we do? Should we buy X/5000 $5K machines, or X/2000 $2K machines, or what? Should the disks be local or shared? How much memory does each machine need? What's the right way to connect them up? Should we dedicate a cluster to Hadoop and map/reduce, and set aside some other machines for problems that don't factor appropriately? Or should we plan to to use a single cluster in multiple ways? What's really required in the way of on-going software and hardware support for such a system?

These are issues that a mildly competent team with varied expertise can figure out if they have the resources to start with. The real problem is how to assemble and maintain such a team and infrastructure in an academic environment in which funding is unpredictable. I generally disagree with Field of Dreams "if you build it, they will come" projects. While we can all agree that some combination of compute and storage servers and distributed computing software can be very useful for large-scale machine learning and natural-language processing experiments, the best way to configure them depends on the specific projects that are attempted. Generic infrastructure hardware, software, and management efforts are not really good at anything in particular, and they have a way of sucking resources to perpetuate themselves independently of the science they claim to serve.

I'd rather see our field push for funding opportunities that pay attention in a balanced way to both the science and the needed infrastructure.

1 comment:

Chris Brew said...

I think the problem goes beyond lack of the
social processes associated with particle accelerators to a more general lack of appropriately fine-grained social processes for allocating resources.

Large scale computation is a highly generic resource. There are more ways of using it and more ways of tuning its delivery than can easily be imagined. Google have clearly hit a seam within which particular patterns of usage gel well with the design decisions that they chose. But this seam is not the only one worth mining. Ideally, we want a data catalysis initiative that would allow everyone to explore the space of alternatives. An ideal to be
aimed for is smooth access to scalable computation
as needed.

Something like this exists, in Amazon's S3 and EC2, which gives paying customers the freedom to expand and contract infrastructure in a more flexible way than if they had to build data-centers.
One can wonder about whether Amazon will continue to provide these services, or whether they are a transitory effect of excess capacity.
As a reviewer I would therefore look with disfavor on a grant proposal that simply assumed the availability of data center resources. A letter from Amazon promising availability would allay the concern. But why would we expect a commercial company to promise that?

So, how about a funding agency backed
clone of EC2 instead of a clone of
Google? The latter seems over-specific, perhaps because specificity is required under the current
science funding model.

Is the opposite true for EC2? Is it really TOO generic. Maybe. The Amazon provision is pretty certain to be sub-optimal for any particular application. But the reliable availability of a known service is a good facilitator. Good teams will be good at using an EC2 like thing to their needs.

A paralllel exists with the uniform provision of A/C power infrastructure to consumers, even though 50Hz or 60Hz, 110v or 240V power isn't everyone's preference. Weird consumers can use transformers, and aluminum smelters can cut a separate deal.