Getting into the weeds on wheat genotyping

I don’t know about you, but I usually skip the methods section of genotyping studies. I know I shouldn’t, but life is way too short. Mostly, I just need the answer. However, two papers came across my desk last week which enticed me to bite that silver bullet. One study genotyped 460 bread wheat accessions from various genebanks in Europe and Australia, manly elite lines from Europe, North America and Australia, but also some Chinese landraces; the other, 1,423 bread wheat landraces from West Asia and synthetics (artificial crosses between the putative original parents of bread wheat), from the CIMMYT and ICARDA genebanks. Quite distinct material from different genebanks, ((Actually, there’s a question about the ultimate source of that Chinese material, but that’s for another time, perhaps.)) you’ll notice, so naturally I wondered to what extent the results would be comparable.

Well, this is the relevant bit from the materials and methods of the first article, by German researchers, which is catchily entitled Subgenomic Diversity Patterns Caused by Directional Selection in Bread Wheat Gene Pools.

For genome-wide marker analysis, DNA samples of all lines were genotyped using the 90,000-SNP wheat genotyping array (Illumina Inc.) described by Wang et al. (2014), which carries 81,587 functional and valid SNPs. Genotyping was outsourced to TraitGenetics GmbH (Gatersleben, Germany) and automated SNP scoring used a cluster file based on worldwide material described by Wang et al. (2014). Raw marker data was processed by first excluding all markers with more than two called alleles, more than 10% missing data, or minor allele frequency (MAF) less than 10%. This resulted in a total of 22,377 high-quality, polymorphic SNPs in the 450 genotypes that were used for population-structure analyses. For all analyses requiring positional information, we used a set of 18,681 SNPs with MAF ≥5% and known map positions on the consensus map described by Wang et al. (2014).

Phew. And this, for your sins, is the corresponding section from the thankfully more racy Exploring and Mobilizing the Gene Bank Biodiversity for Wheat Improvement, courtesy of CIMMYT and ICARDA scientists, mainly connected with the Seeds of Discovery (SeeD) project. ((Sehgal, D., Vikram, P., Sansaloni, C., Ortiz, C., Pierre, C., Payne, T., Ellis, M., Amri, A., Petroli, C., Wenzl, P., & Singh, S. (2015). Exploring and Mobilizing the Gene Bank Biodiversity for Wheat Improvement PLOS ONE, 10 (7) DOI: 10.1371/journal.pone.0132112))

For genotypic characterization, a next-generation sequencing technique called DArTseq was employed. A complexity reduction method including two enzymes was used to generate a genome representation of the set of samples. PstI-RE site specific adapter was tagged with 96 different barcodes enabling multiplexing a plate of DNA samples to run within a single lane on Illumina HiSeq2500 instrument (Illumina Inc., San Diego, CA). The successful amplified fragments were sequenced up to 77 bases, generating approximately 500,000 unique reads per sample. Thereafter the FASTQ files (full reads of 77bp) were quality filtered using a Phred quality score of 30, which represent a 90% of base call accuracy for at least 50% of the bases. More stringent filtering was also performed on barcode sequences using a Phred quality score of 10, which represent 99.9% of base call accuracy for at least 75% of the bases. A proprietary analytical pipeline developed by DArT P/L was used to generate allele calls for SNP and presence/absence variation (PAV) markers. Then, a set of filtering parameter was applied to select high quality markers for this specific study. One of the most important parameters is the average reproducibility of markers in technical replicates for a subset of samples, which in this specific study was set at 99.5%. Another critical quality parameter is call rate. This is the percentage of targets that could be scored as ‘0’ or ‘1’, the threshold was set at 50%. PAV’s markers were not used in this study.

Double phew. But, cutting to the chase: they don’t sound that comparable, do they? I confess I needed help with this, but here’s the bottom line: quite different polymorphisms are being picked up by the two studies. The German work (call it method A) used a genotyping approach that is more expensive, but yields more complete data on a well-defined set of polymorphisms. The SeeD paper’s way (method B) is cheaper, much cheaper, and is better at finding new polymorphisms, but does result in more missing data. And that’s fine. Different research groups will always want to do things their own way, for a variety of both good and bad reasons.

But look at it from the point of view of the wheat community as a whole. One of the things other people who are interested in wheat — genebanks, breeders — will want to be able to do is to see how their material relates to other people’s material: whether it is more or less diverse, to what extent it overlaps in diversity, that kind of thing. So what is team C to do? Follow method A, or method B? Maybe method A and method B, just to be on the safe side? Or maybe it could use its own favourite method C, as long as at least a subset of the polymorphisms picked up by all the three methods was something that everyone agreed was an adequate common denominator.

Well, that’s just the kind of decisions that DivSeek is there to help team C (and D, and E…) make. The DivSeek steering committee met last month and a short report from Susan McCouch, the chair, is now available. She sees the committee’s main job in the next few months as coming up with specific ideas on how “many independent, stand-alone efforts … [can] work together under a common umbrella to apply state-of-the-art genomic, phenomic, molecular and bioinformatics tools and strategies to characterize crop diversity and to integrate and share data and information.” If that means I can skip methods sections with a clear conscience, it will be worth it.

One Reply to “Getting into the weeds on wheat genotyping”

Karl Schmid says:

July 29, 2015 at 5:52 pm

This issue is not unique to wheat but applies to many crops and other model and non-model species.

As long as the high-throughput high-quality sequencing of complete genomes of individual genotypes is not the standard one has to compromise on the genotyping and sequencing methods and also allow for a diversity of methods because they all have pros and cons.

Nevertheless group C can in principle use the data from both groups A and B.

How?

As soon as it is available all marker data should be mapped to a reference genome of a single genotype (and provided as a resource by DivSeek?) and the map position allows then to select markers from both data sets.

The next step should be to use pan-genomes of multiple reference genotypes/varieties to facilitate the identification/mapping/annotation of presence-absence variants and rearrangements. The theory and software tools are currently developed to allow these things and they should be available in the not-too-distant future.

Hopefully DivSeek finds a way to allow groups who have started with the characterization of genebank material to their experiences and contribute to the development of a robust and transparent system for data storage and utilization of PGR data.

One Reply to “Getting into the weeds on wheat genotyping”

Leave a Reply Cancel reply