Various friends have sent me, over the past few days, different takes on a recent paper which used the Google PageRank algorithm to identify the most “important” species in food webs, perhaps because they know I’m a sucker for examples of cross-pollination between disciplines. The BBC had its say, and also ScienceDaily, among others. I posted the ScienceDaily article on Facebook, as I am wont to do when I think something is interesting — maybe even have a gut feeling it might be relevant to agrobiodiversity conservation — but don’t know quite what to make of it. Sure enough, someone left a comment that he thought the algorithm was a secret, which was also my understanding: Google don’t want people to manipulate the rank of their web pages. But then someone else came in and said that the basics of how the thing works are in the public domain.
To prove it, he provided a link to an American Mathematical Society article entitled How Google Finds Your Needle in the Web’s Haystack. Which is why I love social networking, but that’s another story. Now, that article is definitely NSFW, unless you work at the American Mathematical Society, so think twice before clicking, but here’s the lede:
Imagine a library containing 25 billion documents but with no centralized organization and no librarians. In addition, anyone may add a document at any time without telling anyone. You may feel sure that one of the documents contained in the collection has a piece of information that is vitally important to you, and, being impatient like most of us, you’d like to find it in a matter of seconds. How would you go about doing it?
And I thought to myself: just change that 25 billion, which of course refers to the number of pages on the internet, to 6.5 million or 7.2 million or whatever, and the guy could just as easily be talking about accessions in the world’s genebanks.
Now, basically we search for the germplasm we need by starting with a big dataset and applying filters: wheat, awnless wheat, awnless wheat with such and such resistance, awnless wheat with such and such resistance from areas with less than x mm of rainfall per annum, and so on. Would it make any sense to rank the accessions in that initial big dataset? On what basis would one do that anyway? That is, what is the equivalent of hyperlinks for accessions? Because the essence of PageRank is that important pages receive lots of hyperlinks from important pages. So, numbers of requests? Amount of data available on the accession? But wouldn’t that just mean that only the usual suspects would get picked all the time? Genetic uniqueness, perhaps, then? That would be turning the algorithm on its head. Looking for lack of connections rather than connections to other accessions. You could in fact have different ranking criteria for different purposes, I suppose.
Ok, now my brain hurts. This cross-pollination stuff can be fun, but it is hard work.