Sunday, April 29, 2007

Weinberger's Miscellany

Weinberger's Miscellany: David Weinberger, one of the smartest of our many smart neighbors, has a new book about books and planets, Staples and Amazon, 20 questions and the periodic table, Carl Linnaeus and Melvil Dewey, data and metadata -- about everything, in other words: Everything is Miscellaneous. [...] It's hard to summarize his theory of everything in one sentence, but this is pretty close: "To get as good at browsing as we are at finding -- and to take full advantage of the digital opportunity -- we have to get rid of the idea that there's a best way of organizing the world." Weinberger is the first to admit this is a mighty tall order. We were organizing the world (and, implicitly, privileging our particular organizing principles) long before Linnaeus and Dewey. As Weinberger explains, we're basically hard-wired to organize all the atoms and planets we see: "We invest so much time in making sure our world isn't miscellaneous in part because disorder is inefficient -- 'Anybody see the gas bill?' -- but also because it feels bad."

I was listening to this podcast during my hard interval workout today, and I didn't even feel that I was working out, but I was reaching the usual high intensity levels. I kept smiling and saying "yes!" to myself. Weinberger made much better some of the points I have been making here against hierarchical organizations of information in natural-language processing, the "semantic web," and other contemporary attempts to squeeze networked digital information into a traditional hierarchical organization. One of Weinberger's best observations in the show is how traditional forms or organization derive from physical space: everything has a place, and a place cannot contain two things simultaneously. Two points that Weinberger did not make -- they may be in his book, which I'll be getting:

  • It is plausible that our cognitive organization is evolutionarily tuned to those properties of physical space, and thus categorization and hierarchy appear natural and inevitable to us;
  • these physical constraints affect also how information can be organized on paper.
Even though digital memories obey the same constraints at the bit level, efficient replication and indexing create powerful abstractions that effectively erase the constraints. Forcing digital information organization into the old structures will be as silly as it would have been to force paper to degrade what is written on it to simulate the limitations of our brain memories. There is no reason to believe that search will be most effective if it forced to obey those old structures; in fact, there are reasons to suspect that categorization for the most part gets in the way in search, as we see from the repeated failure of supposedly superior approaches to search based on hierarchy or clustering.

The show discussed scientific information just briefly, mainly around the upheavals in biological classification as a result of evolution and genomics. The show did not discuss the fact that hierarchical classification through ontology development is the dominant paradigm in extremely expensive international efforts to organize digitally biomedical information. As an observer of several of those efforts, I can't avoid the feeling that these efforts are misguided, in that biological knowledge advances much faster than those systematization efforts, and new discoveries constantly cross-cut existing categories. Just as one example, until recently "gene" referred to a portion of DNA that codes for a protein, but now the study of "miRNA genes", which do not code for proteins, is all the rage.

The alternative of developing search and distributed sharing of tags and other user-constructed metadata seems much more scaleable, and also more likely to allow for approximate matches and ranked answers that may reveal unexpected associations that would just be forbidden by a fixed categorization scheme.

The show mentioned PennTags a couple of times. It's nice to hear the home team being recognized!

No comments: