Lost in genebank database hell

by Luigi Guarino on August 29, 2008

Navigating around germplasm databases can be a frustrating experience. A posting on the CropWildRelativesGroup alerted me to a Science Daily piece on tomato genomics which mentioned the wild relative Lycopersicon pennellii (or Solanum pennellii, but I’m not going there, at least not today). But how many accessions of this species are conserved ex situ? And where is it found in the wild?

Ok, so SINGER first, as that’s been much on my mind — and on this blog — of late. SINGER shows 61 accessions of L. pennellii, all from the AVRDC collection. Most of them are from Peru, although 7 accessions have USA, Mexico, Poland (?) or “unknown” as source country. None of these accessions seem to have geo-references, so no nice map from SINGER this time. Pity. But SINGER does give very neat summaries for your query results.1

GRIN returns 51 accessions. I can’t find any easy way of working out the duplication between these and the AVRDC material, but I imagine it is significant. Again, most of the accessions are from Peru, but it’s kind of difficult to get summary information across all accessions in GRIN at the moment, though I know they are working on this. Now, tomato germplasm is conserved at the C.M. Rick Tomato Genetic Resources Center (GRIN tells you so), and they have a database of their own. Querying it results in 45 hits, but again there’s no easy way I can see of looking at summary information across all these. You have to look at each individual accession in turn to find out where they’re from, and if you do you get a little map too. The thing I don’t quite understand is why the accessions are geo-referenced in the Tomato Genetic Resources Center database, but not in GRIN. Maybe they’re upgrading the data gradually at the Centre and haven’t passed the latest version on to GRIN? That may also explain the discrepancy in accession numbers. It looks like they’re working on the geo-spatial part of the database, and it may well be possible to get a map of all the accessions of a particular species eventually.

You can of course do that in GBIF right now, but GBIF only has 8 geo-referenced L. pennellii records: from the Missouri Botanical Garden, the Dutch genebank and the European germplasm database, EURISCO. Too bad the Tomato Genetic Resources Center is not a GBIF data provider. And, indeed, that its geo-reference data is not included in GRIN, which is a GBIF provider.

So the answers to the questions I started with are: at least, and probably not much more than, 112, but that probably includes duplicates; and Peru. But I cannot produce a decent map of the distribution of L. pannellii online. I would have to mess around and download the data from the Tomato Genetic Resources Centre database, and then map it myself. Which I may well do, just to show it can be done. But this little exercise does show that there’s a lot of work to be done to improve the data in — and fully integrate — existing agrobiodiversity databases.

  1. Incidentally, AVRDC has its own Vegetable Genetic Resources Information System online, which has 65 records for L. pennellii. []

Jose Iriondo August 29, 2008 at 9:46 am

Totally agree! I am experiencing similar situations with different species.


Eliseu Bettencourt September 4, 2008 at 10:48 am

Yes indeed!
Despite data being in more quantity and quality than ever before, it is not always recorded and maintained in a format that makes it easily, readily and universally available.
The proper management and availability of data are essential to promote the use of biodiversity for research and training for sustainable development and food security.
Having said so, sometimes we forget to check essencial (at European level)sources of information, and that seems to be the case.
Though refered briefely to, EURISCO is not mentioned as one of the searched sources of information. I did! and I can add to the mentioned list 13 more accessions, of which 5 (Lycopersicon pennellii) georeferenced and another 3 identified as Solanum pennellii. Materials maintained in Germany, Netherlands and Poland.


Brigitte September 5, 2008 at 9:05 am

Excellent description of a typical search exercise for information on specific germplasm. I have experienced this frustration too many times and I have almost given up hope. There are lots of great germplasm information systems around that seem to have complementary information on same material so that the total could be way more than the sum of the parts. But very challenging to analyse!


Lisa September 5, 2008 at 7:56 pm

The GRIN database does have the ablility to query over multiple databases but the probelm is that secondary identifiers are not consistantly used or change over time, so the query results are very poor. Hopefully a system can be developed that can pull secondary identifiers out and also allow for quering on other parameter.


Fred September 6, 2008 at 1:44 am

Your scenario about L. pennellii describes quite well the situation breeders and others face in trying to use many of the data bases, an example being the GRIN system. However, some are more user friendly and informative than others. Connectivity is clearly lacking and would help a great deal.

As a breeder I often seek information about traits of practical importance, e.g., disease reaction, composition of fruit/grain, etc. and often that type of information cannot be found. the passport information is of course necessary but those general categories are often not what is needed by a breeder looking for information about which accessions may provide genetic variability for traits of commercial interest.

With the amount of molecular infomation becoming available it is even more critical to have good usable data systems where both phenotypic and genetic information can be combined for making decisions useful for crop improvement.

Clearly much thought and actions are needed to optimize use of the critical natural resource for betterment of humankind.

Much thought and


Ken Fafa Egbadzor September 8, 2008 at 12:33 pm

There are more problems in some other parts of the world. You get to the field and a plant is labeled “no label”. In other instances, you get an accession with necessary information in the records of the genebank but those plants do not exist neither on the field nor the seed store. Still some plants exist but no records on them at the genebank: total confusion!


Michael September 10, 2008 at 3:19 pm

So – can we develop a profile of what germplasm users actually want from online databases? Once it is know what is needed maybe someone can work towards developing solutions.


Jeremy September 11, 2008 at 10:46 am

I think that’s an excellent idea. And let us know when you’ve developed a solution, so we can spread the good news.


Dirk January 16, 2009 at 3:06 am

Luigi is hitting the nail on the head with his description of database hell.

There are a lot of data on plant genetic resources available. Collectors and curators have spent long hours to add detail to their records.

Now, all these data appear to be buried in ever more fancy databases and it is getting near impossible to retrieve a complete record set.

Michael, to keep things simple, just provide a download button to allow users to download complete data sets for individual species, or, if you want to be generous, the whole database for offline perusal.

For example, try to get something meaningful out of EURISCO. The system allows the retrieval of fragmented information ie. long list of partial records BUT no identifier column to link them back together.
That makes the EURISCO database pretty useless from my point of view.

While at it, please, start including evaluation and characterisation data as well.



Julian January 16, 2009 at 4:44 pm

@Michael – What if a webpage is set up with checkboxes to select which data each user/researcher does need? Could needs then be tracked and solutions focused?…

List of my own needs:

– Reviewed ortography to assure the query contains all samples of the species/genus/whatever
– Sample status (wild, improved cultivar, landrace, etc)
– Full reviewed passport data (if lat/lon are not available then at least the locality and country)
– Date of collection (at least the year)
– Principal/most important traits evaluation data
– SSR markers data

We must find a way out of hell!!


Dirk January 17, 2009 at 2:51 am

Passport data are still a big jig saw puzzle. The glue for sticking the bits and pieces together are the collector and donor numbers. These, unfortunately, tend to mutate and transform into often unrecognisable entities during their travels from one collection to another.

A lot of material also has reports associated with it providing useful background information and clues to solve part of the puzzle.

Farmers, breeders and collectors should also get some credit for their efforts. You could even think about given credits to the compilers…

Check out the European barley database http://barley.ipk-gatersleben.de/ebdb.php3 ie. the old search mask

try Stubbe for collector , this gives you all [after increasing the output limit] the barley material which could be linked to the 1941/42 Balkan expeditions.

Another interesting collection is the 1938/1939 SS expedition to Tibet.
Try a search for altitude > 4000 m

Most of this information originates from the paper by Brücher, H. and E. Åberg (1950) Die Primitivgersten des Hochlands von Tibet, ihre Bedeutung für die Züchtung und das Verständnis des Ursprungs und der Klassifizierung der Gersten. Kungl. Lantbrukshökskolans Annaler 17: 247-319. It describes the evaluation of the material collected by the German SS expedition to Sikkim and Tibet during 1938/1939. A total of 1230 accessions was collected. [Ti or Ot collection numbers, donated to IPK by Hoffmann, Halle; another set has come via Muencheberg, MPI Cologne, Braunschweig [H series]]

If you are looking for reproductive frost tolerance high altitude material is a good bet…

Please, note that there is a search field MOST ORIGINAL NUMBER which could be traced. This is one of the most valuable sticky bits for glueing the puzzle together.

Now, why is this information just sitting there and has not been included in the documentation systems of all the collections which hold this material and have donated data?

I think what we need are
a) mechanisms, communication channels and time allocation to update local and international documentation systems
b) a reference apparatus which helps to decode the cryptic notation of collectors, breeders and curators ie. decipher codes
c) link as much germplasm as possible to published and unpublished reports [digitise these reports and provide open access to them, please]

Perhaps, one way to make life easy for everyone involved is to agree on common formats for site codes and collecting numbers [Jan Konopka has done a fine job with the ICARDA database] and to develop an international database for site information to which local collections can link?

Another big effort is needed to document breeding material. We need databases for these as well to document breeders, pedigrees, attributes etc. Again, there is a plethora of literature and hence evaluation data for cultivars.

SPEND SOME MONEY ON DOCUMENTATION and have enough funds to do the nitty gritty stuff too.

Congratulations to the new look and functionality of SINGER. GBIF is also making great strides. The integration of GOOGLE maps is excellent.

Now, a lot of hack work is needed to get more integrated data into these systems.

Interesting times ahead…

Dirk Enneking


