Hacker News new | ask | show | jobs
by qyph 1790 days ago
Is this for some subset of our DNA? Quick skim of the article didn't tell me. But there is a popular factoid that human dna is 98.8% similar to chimpanzee dna on average. Is that factoid false?
6 comments

The popular factoid is correct, but the confusion here is that these are different measurements. Humans and chimps' genomes are similar in that if you align all the bases [A,C,T,G] that can be unambiguously aligned between the two genomes, 98.8% of the bases are identical. For modern humans to Neanderthals, that number is 99.7%, and between two random modern humans, it would be ~99.9% on average.

This paper is asking a subtly different question - how much of the modern human genome is strictly human, not by simply lining up bases and running a diff, but looking at the inheritance of chunks of DNA ("haplotype blocks", size determined by processes of recombination, etc.) to try to understand how much and which regions of the modern human genome came from interbreeding with Neanderthals or Denisovans. There was variation in the pre-human population before the human/Neanderthal split, which means that if you compare just a single human to a single neanderthal, you'll find unique variants to each. However, most of those variants will have existed in both the human and neanderthal populations, so they should count neither as uniquely human nor neanderthal (knows as Incomplete Lineage Sorting, or ILS).

The chunks in modern humans that derive from Neanderthals or Denisovans are different in different people and broadly across population groups (e.g. highest percent introgressed in Melanesians, lowest in Africans). But across all the modern humans in the study, there are regions where Neanderthal/Denisovan inheritance or shared variation (ILS) was never seen - that's 7% of the genome ("deserts"). And just 1.5% of the genome was in chunks where moderns human commonly have a unique mutation compared to Denisovans/Neanderthals.

This comment is better quality content than the article. I just wish I understood more of it.
human genetics/genomics papers assume you understand an enormous amount of details of human genetics, the knowledge of which has been hard-won over a century. I worked in this field for decades, know a ton, and still have to read every sentence in this paper carefully, multiple ttimes, and check various sources to remind myself of technical concepts. There's a reason I switched to working on computing full time- a lot of these state of the art arguments can only be contributed to by people who are, well, state of the art, thick-skinned, and well-funded.
If you’re a layperson wrt genetics, check out ‘The Gene’ by Siddartha Mukherjee - it’s not too dense, interspersed with personal stories he relates to genetics and a great read. His book on cancer is also amazing, ‘Emperor of all maladies’
In my limited experience, these are two of the best non fiction books I’ve read. The author makes these subjects utterly fascinating. If he released a book about watching paint dry, I’d buy it in a heart beat and read it.
1.5% of the genome sounds like a crazy big number. If that much of genome went the wrong way, we could be dragons.
Would it be safe to say it's not the percentage of difference that matters, but the role of DNA that's different?

The point being, not all DNA is equal so to speak. That a couple of changes can have massive impact?

Yes, a single base change can give you a disease, cause intellectual disability or make you never be born.. There are mutations we never see in adults, as they are embryonically lethal.

Generally, you can tell what matters by seeing if it's been under selection - ie the frequency of that version in the population changes more than randomly.

Thanks, your comment makes the basic differences clear to me.
> However, most of those variants will have existed in both the human and neanderthal populations, so they should count neither as uniquely human nor neanderthal (known as Incomplete Lineage Sorting, or ILS)

This makes a hash of the headline "Thanks to interbreeding, just 7% of our DNA is unique to modern humans" -- this would be just as true if there had never been any interbreeding between "modern humans" and their various sister lineages.

From the paper [0] they say in the abstract that the differences are 1.5% - 7%. So the lower end of that lines up with the 98.x% similarity.

Then in the paper there is this paragraph:

> Our ARG strategy allows us to bin the human genome into regions containing archaic admixture in at least some humans, regions of ILS, and regions free of both archaic admixture and ILS in all humans (hereafter archaic “deserts”). We find that approximately 7% of the human autosomal genome is human-unique and free of both admixture and ILS. Roughly 50% of the human genome contains regions where one or more humans has archaic ancestry obtained through admixture. If deserts are further restricted to regions that contain a high-frequency, human-specific derived allele, i.e., a substitution that can be assigned to the human lineage (hereafter “human-specific regions”), then these comprise only 1.5% of the assayed genome (Fig. 4A).

Maybe someone here understands what these words mean and can clarify?

[0]: https://advances.sciencemag.org/content/7/29/eabc0776

I asked the author (I was a postdoc at the lab he was a grad student in about 20 years ago).

My question: """Are these differences evaluated/inferred using data from all regions of the genome (intergenic, viral repeats, etc) or just genes? I recall that the early reports that compared primates to humans just used genes (or maybe just the easily aligned regions, but out of order) which seemed like a big omission."""

His answer: """We used the Simons Genome Diversity panel (full phased genomes for ~300 people), along with Neanderthal and Denisovan genomes to make an ancestral recombination graph (ARG). The ARG is a sequence of trees describing relationships between everyone all along the genome. It's really just a sequence of trees at each variable site. Then, you can look at these trees and find segments where the archaics fall outside the variation of the humans. These are regions where no human shares ancestry with archaic either by recent admixture or by incomplete-lineage sorting. Turns out that's about 7% of the genome. What's in that 7%? It's a lot of genes and specifically a lot of genes involved in neural development and neural function! The method itself is blind to what is genic or nongenic. But this method is about the genealogy of genes across the genome, i.e., from whom they were inherited and not necessarily how different the versions were. In other words, it's about the topology of the trees across the genome, not their branch lengths."""

Beyond that things start to get really complicated, you need to understand concepts like haptotype blocks, how new genes arise, etc.

the chimpanzee factoid was distorted in a way that made the number look much larger than it really was. It only compared highly conserved regions in common between both organisms.
So what would be the 'real' number?
dunno how you would measure it, for example if humans have an extra chromosome with a bunch of novel genes, how does that count? Or if the genes are identical but scrambled in order?

It's sort of not even a really useful metric unless defined carefully.

I’d heard that as well and thought that 7% unique seemed extraordinarily high compared to the factoids I’d integrated over time.
I just spent some time looking for the source, but I seem to have lost it, so take this all with a grain of salt, but based on some article I read:

My understanding of the "shared DNA" factoids is that they are based on a very rudimentary analysis of DNA sequences. The human-chimp 98.8% thing is "true" in that 98.8% of sequences that appear in the human genome also appear in the chimp genome. But this ignores other axes of differences: Humans may have multiple copies of certain sequences, while chimps only have one. Or certain sequences may appear in completely different parts of the genome in one or the other. And all of these differences are relevant.

TL;DR: 98.8% is true, but not really relevant for determining how "different" we are from chimps.

Nitpick: factoids are unverified and mostly inaccurate and false. Facts are true.

Edit: corrected factoid definition to be a bit more accurate.

Isn’t what you just said technically a factoid?
How is this a nitpick? He used the word factoid correctly.