The pan-genome is the new genome

Our friend Ruaraidh Sackville Hamilton has kindly taken a break during his retirement to answer a burning question. Thanks, Ruaraidh, you can get back to your G&Ts now.

“What earthly use is this?” asked a well-known mutual friend in response to the recent publication of a “platinum standard pan-genome resource” for rice.

I assume he wasn’t questioning the value of reference genomes. After all, everyone knows that the Nipponbare reference genome has enabled rice scientists to do things that are still a dream for other crops. So I assume “this” refers to sequencing 12 more reference genomes for Oryza sativa, to make a total of 16.

Where to start? Suppose you’re a pathologist studying a variety with a disease resistance gene that’s completely absent from Nipponbare. What earthly use is the Nipponbare reference genome to you? None.

Or suppose you’re a diversity scientist trying to quantify diversity in the genepool of Oryza sativa by comparison against Nipponbare. You find that the more different a variety is from Nipponbare, the more missing data you have, and the less you can tell about its genome. How useless is that?

Large indels and long-range structural variation in the genome present insurmountable problems when aligning short-read sequences to a single reference genome. To get some indication of the magnitude of the problem, look at an earlier paper “Genomic variation in 3,010 diverse accessions of Asian cultivated rice.” Coverage of 453 of these genomes was sufficiently good to enable some sort of de novo assembly and thus overcome the problem of a single reference. The “core genome” (the part of the genome that is present in all varieties tested) contains little more than half the gene families that are present in at least one accession (figure 4c). And, on average, pairing a japonica variety with an indica variety you get 2,878 genes that are present in only one (figure 4e). That’s an awful lot of uselessness in a single reference.

And look where Nipponbare sits in the phylogenetic tree shown in figure 1 of the new paper. It’s way off at one end, highly unrepresentative of the species.

And look at the genome sizes in Table 3. Genomes of the japonica group (which includes Nipponbare) are on average around 12 million base pairs shorter than those of the indica group (which is the more important group in tropical agriculture). That converts to a lot of missing genes.

So, rather than ask “What earthly use is this?”, I’d turn it around and say “Why has it taken so long to get here?”. As long as we are constrained to short reads for low-cost high-throughput sequencing, we need multiple reference genomes for every crop, so that we can build a pan-genome per crop.

The pan-genome is the new genome

One Reply to “The pan-genome is the new genome”

Leave a Reply Cancel reply