In June 2000, Bill Clinton, the then US president, stood smiling next to the leaders of the Human Genome Project. “In genetic terms, all human beings, regardless of race, are more than 99.9% the same,” he declared. That was the message when the first draft of the human genome sequence was revealed at the White House.
The single string of As, Ts, Cs and Gs eventually became the first human reference genome. Since its publication in 2003, the reference has revolutionized genome sequencing and helped scientists find thousands of disease-causing mutations. Yet at its core is a somewhat ironic problem: the code meant to represent the human species is mostly based on just one man from Buffalo, New York.
Although humans are very similar, “One person is not representative of the world,” says Pui-Yan Kwok, a specialist in genome analysis based at the University of California, San Francisco and Academia Sinica in Taiwan. As a result, most genome sequencing is fundamentally biased.
This bias limits the kind of genetic variation that can be detected, leaving some patients without diagnoses and potentially without proper treatment. What is more, people who share less ancestry with the man from Buffalo will probably benefit less from the incoming era of precision medicine, which promises to tailor healthcare to individuals.
To combat this, researchers have started to assemble reference genomes for specific countries, including South Korea, Japan, Sweden, Denmark and the United Arab Emirates. They hope this will serve their populations better, but critics worry it could turn migrants into second-class citizens in their healthcare systems. Now, a huge new project is offering a different solution with the aim to represent global diversity: a human pangenome.
Pprecision medicine, also known as personalized medicine, has been a buzzword within the medical community for years and it undeniably sounds good. “Getting the right medicine to the right patient at the right time is the tagline,” says Neil Hanchard, a physician scientist at the US National Human Genome Research Institute.
But standard genome sequencing misses a lot of variation that could be connected to disease. In most cases, it works by chopping DNA into small bits known as “short reads”, before sequencing them and organizing them into a genome using the reference as a guide.
Single nucleotide variants (SNVs) – a change from a C to a T in the code of a gene, say – are mostly easy to spot this way, but larger chunks of variation known as structural variants (SVs) are trickier. New sections, sometimes hundreds or thousands of base pairs long, can go undetected, as can sections that are missing, reversed or moved somewhere else. In those cases, short reads cannot easily be mapped to the reference and “a whole bunch”, says Kwok, are thrown away.
This means that standard genome sequencing is biased towards the SVs already in the reference. If your SVs differ, you end up with a sequence that does not fully capture your personal variation. As it is these small differences between people that we hope will tell us, for example, why one person might respond well to a medicine but another person will not, that is bad news.
Kwok’s work hints at the amount of SVs going undetected. In 2019, his team analyzed samples from 154 people around the world and found 60m base pairs-worth of SV genome content missing from the reference, with much more still out there. A follow-up of 338 people that looked only for extra inserted DNA found nearly 130,000 new sequences.
But SVs also appear to show different frequency patterns in different populations. By extension, says Kwok, if a person “is from a population quite different from the person from which the genome reference is derived, there will be more misalignment” when their short reads are mapped to the reference. Consequently, he says: “We may miss risk variants in those regions not represented in the reference.”
This lack of representation is a general problem in genomics. Even the more studied SNVs show large data gaps. Recently, for example, Hanchard and his colleagues sampled 426 individuals from 50 ethnolinguistic groups across Africa and found more than 3m new SNVs, mostly from populations that had never been sampled before. “We haven’t even touched [SVs],” says Hanchard, “but our preliminary data suggests it’s going to be more of the same.”
Such data disparities directly affect medical outcomes. For example, if a person with a rare variant has a rare disease, there is a good chance the variant is responsible. But often we do not know whether variants are genuinely rare, or just common in understudied populations. In those cases, doctors cannot give a diagnosis. “For persons with non-European ancestry, that occurs a lot more,” says Hanchard.
As we move into an era of precision medicine, that will only become more important. Kári Stefánsson, whose Reykjavik-based biotechnology company DeCode Genetics specializes in connecting the dots between genetic variants and disease, says that what keeps him up at night is that our understanding of diversity within populations of European descent is now so good that we can start to use it for precision medicine. But for other populations, “We do not have the same kind of data,” he says. “[This] is going to increase healthcare disparities above and beyond what they are today.”
While there are no genetic underpinnings that meaningfully group people into different races, some believe it makes sense to create references to capture the variation within specific populations, such as ethnic groups and nation states. One country that now has its own reference is Denmark.
“What we see is that there is a lot of variation that [has only been detected in] the Danish population,” says computational biologist Simon Rasmussen of Copenhagen University, who led the work. That is a strong argument for a local reference, and the appeal is obvious: a reference based on Danes is uniquely positioned to supercharge the Danish healthcare system.
But some criticize national genomes for focusing too much on differences between populations, rather than individuals. Medical anthropologist Emma Kowal of Deakin University in Victoria, Australia, worries that national genomes might “keep the idea of race alive”. And framing genomes in terms of nationality does inevitably lead to exclusion, says Jenny Reardon, a sociologist of the life sciences based at the University of California, Santa Cruz. “We are deciding, in effect, who is Danish and who is not.”
Rasmussen admits the reference would be less useful for the 15% of the Danish population who are migrants or their descendants. Samples from people with mixed ancestry were even removed during the selection for the reference. But because of consent problems the reference never made it to the clinic, so Rasmussen and his team want to create another. For that, he says: “We want to take a different approach [selection] approach.” Exactly how is yet to be determined.
There is an alternative to the national genomes, though. Instead of zooming in on different populations, the Human Pangenome Reference Consortium wants to zoom out; overlaying many genomes to create a reference that has variation built into it – a pangenome. The consortium recently published the first draft of such a reference in a preprint.
Made up of 47 exquisitely detailed genomes, the draft represents the first chunk of the 350 genomes it is planning to sequence to include the most common variation across the world. “This is not a standard that has ever been performed before,” says Karen Miga of the University of California, Santa Cruz, who is part of the consortium.
But the project is not just about sequencing more diverse data. “We need to come up with a better data structure to encode that information,” says Miga’s colleague Ting Wang of Washington University School of Medicine in St. Louis, Missouri.
That data structure is called a genome graph. In contrast to the current reference, which is just a long string of letters, the genome graph shows variation between genomes as detours on an otherwise shared path. That will enable researchers and doctors to map short reads to the version of the path that best fits their sample.
The natural question is: how does one choose who gets to represent the world? The first genomes qualified because of their high technical quality, but the consortium will need to choose new samples in the future. Since Africa is the cradle of humanity, Miga says: “The vast majority of the genomes that we are including are of African ancestry.”
From Reardon’s perspective, however, 350 people might do a better job of representing the world than one person, but “[the consortium] have made some choices about groups,” she says. “Who did they sample?” Who did they not sample? As long as the reference contains only a subset, arguably someone will not make the cut.
Miga does not deny that. “[We are] really trying to capture common variation at a global level, so things you would see quite frequently,” she says. Documenting common variation in this case leaves out uncommon variation. “If you’re looking for something extremely rare,” she says, “that is not our charge at the moment.”
In an ideal world, individuals would have their genomes sequenced without the use of a reference. This has long been held up as the ultimate, problem-free solution, but hardly anyone believes that it is on the cards. “It’s not a trivial undertaking and I don’t see it being non-trivial in 10 years’ time,” says Hanchard.
And rather than using a broad, global pangenome, countries might be swayed by a reference more tuned to their population, as well as maintained and controlled by themselves. “We don’t really expect anyone other than the Danes to make a Danish reference genome,” says Rasmussen, who hopes the next iteration will be run by Denmark’s state-controlled National Genome Centre, potentially as part of the EU’s Genome of Europe project .
Hanchard also sees the benefit of local or regional references. “[The pangenome] is not going to have all the variations represented,” he says. He is part of the H3Africa consortium, which aims to bring the benefits of genomics to Africa and is considering an Africa-specific genome graph. At the same time, he expects all these references will probably eventually coalesce.
When asked about his hopes for the future of genomics, he speaks of knowing and understanding the variation as it relates to himself, or anyone else with Jamaican ancestry. “I would love to get to a point where everyone feels represented and that this is for them, as much as it is for any particular group,” he says. “We are from one humanity, that’s the important part.”