Germplasm documentation is a two way street

I don’t blame Pat Helsop-Harrison for not donning the fire-proof suit to find his Ug99 resistant lines, or whatever. I really don’t. Why would anyone venture into Genebank Database Hell, when they have such a friend in Google? I only do it myself because, well, I get paid to. But although the genebank community bears some of the responsibility for this situation, and is indeed trying to do something about it, part of the problem lies, I’m sorry to say, with the users of the data.

I’ll give you an example. Over at Vaviblog Jeremy has a post about the domestication of African rice. He describes some fancy DNA work from researchers in China who pinpointed the area of domestication of Oryza glaberrima by comparing the sequences of 14 genes in 20 accessions of the cultigen with 20 accessions of the progenitor, O. barthii. And where did they get that material?

All accessions were obtained from seeds provided by the Genetic Resources Center of the International Rice Research Institute at Los Baños, Philippines.

That’s great, of course. This kind of research use of the international collections is exactly what one wants to see. That’s what they’re there for. But when I asked IRRI about this, they said that “these authors aren’t included in the names of recipients in our database.” So they must have got the seed indirectly in some way. Again, fine. But when users use seed from a genebank like this, it would benefit everyone — the genebank itself, future users — if they fed back their results in some way, to add to the store of information the genebank has on its accessions.

Now, not all genebanks make this as easy to do as they could. That’s why we have on occasion suggested a social networking approach to germplasm documentation, to resounding silence. But users don’t go out of their way much either. Until the two communities come together more on this, the best way to find seeds will probably continue to be Google.

19 Replies to “Germplasm documentation is a two way street”

  1. Good points. But I think that in this case, the authors have done their part. They deposited the sequence data in GenBank, and the records include the IRRI identifier. Now it is up to the IRRI genebank, and for SINGER, to link to the data in GenBank.

    Here is GenBank record that refers to this accession.

  2. I do not think that “exaggeration” is an apt way of describing an example that illustrates the opposite of what it is intended for, but I do hope that you keep bringing these issues to the attention of those gene-bank database devils.

  3. OK, let me step in with an apology, an explanation, some questions, a caution and a recommendation.

    First, apologies to Luigi for not searching carefully enough before answering his question.

    I just looked in our table of recipients for the last names of each of the authors, and found nothing. Silly me. For the Chinese the family name is the first name. If I’d looked under the first name or the institute I would have found it. I’ve now looked up the paper and traced every sample sent. We sent them in four separate shipments between 1997 and 2007. The first shipment was to the same laboratory, but none of the authors. The subsequent three shipments were sent to the last author.

    Now the question. Robert says “it is up to the IRRI genebank, and for SINGER, to link to the data in GenBank”. Seriously? How? Should we google the literature, or search specific online databases such as GenGank?

    And the caution: it is a common mantra to call for recipients to return data “to add to the store of information the genebank has on its accessions”. Be careful you don’t encourage genebank curators to attach recipient’s data to their own accessions. That would be wrong. You have to keep the data associated with the recipient’s germplasm, and link their germplasm back to the genebank accession. You need also to know and record the nature of that link. For example, how did they choose the plant to genotype, and how likely is it to be typical of the original heterogeneous accession?

    The ICIS mantra is that’s why we use ICIS, not a traditional genebank database. Traditional genebank databases don’t allow you to define links between the genebank’s accession and the recipient’s sample, so you can’t do what you want. If you seriously want to connect recipients’ info back to the accessions, you’d better think seriously about incorporating germplasm tracking into GRIN and Genesys.

    1. There’s a lot there to digest, but let me start with one thing. You ask Robert how he thinks genebanks should link up with GenBank (as it were). Can I turn that back to you? You seem to imply that it is not up to genebanks to do that. Or maybe that it’s just too big a job, I’m not sure. So are you saying that users should do it? You guys have been working on ICIS for a while. Have you come up with a little widget that allows a user to make the connection, so you don’t have to go looking? Is there no room to be a little proactive about it?

  4. No, I don’t mean to imply it is not up to genebanks to do that. What Robert said is that the authors have done everything they should do, and everything else has to be done by the genebanks. I do not think that is reasonable. Expecting us to link to data we know nothing about is unreasonable. Moreover, I would question the morality of trawling the internet searching for other people’s data to attach to the accessions under our management. Moreover, authors do not follow consistent standards in identifying how they selected the particular plants for genotyping out of our variable accessions. So even if we did do the unreasonably and morally questionable, we still wouldn’t be able to reliably make the correct connection between their sample and the accession.

    What we would like to happen is that all our partners use ICIS. If they did, then their very act of creating a new germplasm record to document the sample they receive from us would create the links back to our accession, and they would / should correctly and systematically record how they selected the plants. Whatever data they make public in the central database would be immediately visible to anyone querying our accessions, and the data would be correctly attached to their germplasm, documented as derived from ours. Database heaven. That’s what you get from ICIS’ completely comprehensive germplasm tracking system.

    And yet … should we be more proactive? For the benefit of the majority of our partners, who are not ICIS users, we could, at the time of shipping the material, already create new germplasm records in the database to correspond to their sample, correctly linked back to ours, and we could tell them that those records exist especially for them, and if they ever publish data electronically they should specify the germplasm record we created for them.

    It wouldn’t be quite right – the date would be the shipping date, not the date they received it; and the source database would be recorded as our database, which is not how it ought to be seen. But it could make it easier to make the data links.

    Technically perfectly feasible and very simple to do, but I’m not sure how wise it would be to do.

    1. The thing with GenBank is that you can automate the linking. GenBank is one of the rare cases where authors publish raw data in a central database, using IDs that link the data to a voucher (genebank accession). You do not need to force anyone to do this; the journals provide the stick & carrot. No GenBank, no paper. You may need to ask seed recipients to use the accessions numbers exactly as you need them, and enter it in the correct GenBank field, such that you get easy cross-linking as in this example: the collection and GenBank.

      That is a big catch. I would think it could be worth it to go fishing for more links to your genebank’s accessions. Assuming this would lead to useful information, and that you can afford it (it is cheap), I’d say it would be morally questionable to not trawl the Internet.

      Having separate sub-accession entries in your database is probably a good thing, if you say so. But it seems to be a rather fine point relative to having data or not. Perhaps I misunderstand you, but you seem to be saying that storing data for an accession is not relevant at all, because there is variation within accessions and in methods. Whether seeds are evaluated internally or externally, there is always going to be variation, and it seems to me that by storing who provided the data (the publication) you can capture much of that.

      Most organizations doing research with genetic resources do not have databases. You would be providing an important service to them, and to the international rice community, by creating the opportunity to systematically store the results for the long term. The USDA does that by storing such data. But you do not have to follow that model if you do want it in your database. There are many scientific data publishing initiatives (e.g. dataOne; and Google’s data services.) One must assume that there will be more and better systems like GenBank in the future. But while there is a gap, you could think about filling it. Not necessarily by curating data from others; but by help setting up a system such that people can deposit data in such a way that it can be easily discovered, integrated, and linked to a genebank database (e.g. SINGER). It know it all takes time & effort; but approaches like that could work for all genebanks, so if could be very efficient to tackle these things as a consortium of genebanks.

      1. Sorry, it seems again I am not being clear enough. Of course I am not saying storing data for an accession is not relevant. That would be crazy.

        But you must know the curator’s nightmare. A request comes for a variety, or, if you’re lucky, a particular accession ID. Then comes the complaint that it’s not the right material. You chase up the problem, and discover that the requestor found the reference in a publication and either failed to notice, or the author failed to report, or even got it from another source and didn’t know, that the experimental material is actually a rare variant deliberately selected out of an accession.

        I hope you can agree that it is misleading to treat data on a rare variant of an accession as if it is representative of the accession. But it happens, and it happens very often. This cannot be simply dismissed as a “fine point”.

        That’s what I want to avoid. If we uncritically go around attaching other people’s data to our accessions without knowing how they derived their sample from ours, we merely encourage bad practice, and it’ll come back at us in the form of more complaints.

        And of course, as highlighted in a recent discussion on this site, their sample may be totally different from the original through mislabelling, contamination etc., which just adds to the problem.

        Which, to repeat, is not to say we shouldn’t do it at all. I agree we could do more. But let’s do it carefully. Let’s document how the experimental material was derived from the accession. If we don’t know, at least avoid giving the impression that we know the results are representative of the accession.

        And I agree it should be done across all crops. So, Genesys – how about it? Genesys version 2 is already being conceptualised to handle genetic resources that are not accessions. With that generalisation, it could do what you want in the way that I think would be right.

  5. Thanks for clarifying; I do agree with your argument. Fine point/ major issue? As long as the perfect isn’t the enemy of the good. Cheers.

  6. After hearing about the Genebank Database Hell, we learn about the curator’s nightmare! :-)

    The reconciliation of the results obtained on a germplasm sample back with the original accession ID from which sample was derived was discussed by Ruaraidh last week in the ICIS workshop and is indeed not trivial!

    However, I would expect that the method and protocol used to produce the germplasm sample from the accession and then any other sample creation from that germplasm (e.g. DNA sample) is describe in of the ‘Material and methods’ chapter of a published paper, isn’t it the case? the mapping of the sample(s) ID to the accession ID should be included as well.

    Regarding Ruaraidh comments on : “help setting up a system such that people can deposit data in such a way that it can be easily discovered, integrated, and linked to a genebank database.” Several databases managing ‘omics’ data do link the published papers or stored data files back to the sequence by annotating or tagging the article content, the data file or photos metadata. Some journals like Nature or Plant Physiology have pilot projects with big databases like TAIR to request authors to fill in a form and tag their own abstract and paper prior to submission. In such a registration form you could request the accessions IDs to be entered ? This form could be applied when data files are deposited. The objective of this tagging is to have at least the published paper or data file be ‘discoverable’ and potentially linked back to a database entry (in our case: accession ID or sample ID). This is aligned with Robert suggestion about having a GeneBank-like system of data deposit bearing an obligation to document the accession ID.

  7. Doesn’t the SMTA warranty (Article 9.1) “protect” us from passport and evaluation uncertainty? Indications of putative inherent germplasm properties needs to be interpreated by each user based on megadata and referenced citations, allowing germplasm hunters and users to make their own inferences as to the quality of the data reported.

  8. Thanks to Reinhard Simon for the following comment, sent by email, and slightly edited here.

    Overall I agree with Ruaraidh. Some additional thoughts on related aspects.

    In these days of internet it is easier than ever before to set sail on the web and perhaps inadvertently become a data pirate; already years ago site owners would often not allow what was called ‘deep linking’ (directly linking to information bits like images and data and avoiding the proper attribution via the home page). As scientists or curators of public CG genebank databases I agree with Ruaraidh that we should make it a point (for scientific and social responsibility) to not uncritically scavenge and republish data. We should rather filter out data where we understand how they relate to the materials and it should be no question either to always fully acknowledge all resources used (materials but also data authors, institutions, tools and donors), not publish prematurely other peoples work and critically select the most trustworthy ones. YET the aspect of proper attribution and the wisdom of verifying data is all too often forgotten…

    We certainly have certain rights to use published information for documentation purposes: however, we must make sure that indeed this is genuinely peer reviewed published information (not just grey literature or something that someone put on the internet perhaps without any right to do so) because only that implies the ‘prior informed consent’ of the owner or author of re-using that information. Citing that publication by ‘linking’ is one way but we should keep in mind two aspects: a) this link is easily broken (unintentionally or intentionally; think about all the outdated links) and b) that this must be for each ‘observation’ or datapoint. E.g. like in genbank each observation unit comes with a reference also in our databases (compiled from many sources) of passport, characterization and evaluation data each observation or datapoint (or cell in a table) should be directly linked to a proper reference. And this should be very transparent and obvious to all data users: time and again there are cases where people ‘forget’ to properly cite all the original contributors… [C]ollectors and curators often spent their life times on building a collection and understandably get upset if all that data is re-published somewhere without acknowledgement. But there are established best practices. E.g. it is required in printed taxonomic treatments to reference each single specimen to the original sources in various ways to facilitate track-back from different entry points.

    So, the big challenge in our internet age for data curators of (electronic) databases is NOT how to scavenge the web or provide easy access to tabulated summaries – but rather to filter out verified data and make it as difficult as possible for everybody to forget about or break the reference to the original authors/contributors/resources for each single data point.

  9. Where is the like button?
    I do like Luigi’s comment “So, the big challenge in our internet age for data curators of (electronic) databases is NOT how to scavenge the web or provide easy access to tabulated summaries – but rather to filter out verified data and make it as difficult as possible for everybody to forget about or break the reference to the original authors/contributors/resources for each single data point.”

    Acknowledge each data point. That is a good strategy to honour and encourage improvement of PGR documentation.

    1. We don’t have a like button, but you can G+ us.

      And that’s not actually Luigi’s comment about sources. It is Reinhard Simon’s. As Luigi said.

      Credit where credit’s due!

Leave a Reply

Your email address will not be published.