Hacker News new | ask | show | jobs
by jshorty 1031 days ago
Could someone explain exactly what it means to be "completely sequence" the human genome when all humans have distinct genetic makeup (ie, different sequences of nucleobases in their DNA/RNA)?
12 comments

The public Human Genome Project used a group of people but most of the sequence library was derived from a single individual in Buffalo, NY. The celera project also used a group of people but it was mostly Venter's genome

https://www.nytimes.com/2002/04/27/us/scientist-reveals-secr...

I believe more recent sequencing projects have used a wider pool of individuals. I think some projects pool all the individuals and sequence them together, while others sequence each individual separately. This isn't really so much of a problem since the large-scale structure is highly similar across all humans and we have developed sophisticated approaches to model the variations in individuals, see https://www.biomedcentral.com/collections/graphgenomes for an explanation of the "graph structure" used to reprsent alternatives in the reference, which can include individual single nucleobase differences, as well as more complex ones such as large deletions in one individual, to rearrangements and even inversions.

We really should say "a human genome". Reference genomes serve as a Rosetta Stone of genomics. So we can take DNA/RNA sequences from other individuals and align (pattern match) them to the reference as a way of understanding and comparing individuals.

It is not perfect, as a references can be missing or have large variability in DNA regions. The goal of the Human Pangenome Reference Consortium (HPRC) https://humanpangenome.org/ is to sequence individuals from different populations to address this issue. We are also working to develop new computation models to support analysis of data across populations.

They mean they have obtained the complete sequence for a particular Y chromosome that is considered to be a "reference" chromosome. This is similar to what was done for all the other chromosomes.
I've never understood this either. I assume the genome is many megabytes of [ATCG]+. If we have that sequence, what does it tell us? Do we look at it and say "Ah, yes, ...ATGCTACGACTACGACTAGCG... very interesting?"
Many genes are highly conserved or consistent enough. E.g.: if there's a 1% difference between two people, then it's a bit like two very unique sentences that have a couple of small typos. They're sill recognisable, and it's also still pretty obvious that they're the "same".

A gene sequence allows researchers to determine the amino acids that are coded for, and from those, which proteins match which genes.

This can be matched up with genetic diseases. If you know that damage to a certain location in a chromosome causes a problem with a certain biological process, then ergo, the associated protein is needed for that process!

So: genetic illness -> gene sequence -> protein -> role in the body

Without sequencing, that chain can't be built.

But you can only know that by having a large sample of very “stable” (have few genetic irregularities) gene samples compared to a large pool of samples from people with very narrow and pronounced gene irregularities, right?

Is this why it’s so hard? This feels more like a healthcare records keeping people and less like an “actually reading the data problem”.

I can’t help but feel like some form of single payer healthcare is truly the way out of this problem. One where all disease record keeping is uniform and complete.

Single payer healthcare here (UK) is still subject to privacy controls in a way which would make it very difficult to do that.

(Also our health system's IT is a hellscape, but one reason for that is that people would literally rather not have a working system at all, than one with less than impeccable privacy controls.

Personally I'd gladly sacrifice a fair bit of medical privacy in return for giving scientists greater insight into disease processes, but the average citizen here wants advanced healthcare without giving their data to research scientists. /facepalm )

I trust the scientists, the problem isn't them. Look at the whole abortion data scandal in the U.S.
The problem there is the US' insane theocrat-conservatives (or just misogynist assholes hiding behind a thin veneer of religious justification, as the case may be).

I'm not saying a health IT system should have no privacy controls either. But the requirements for such controls need to be balanced against having a system that actually works, and that means having some people who actually understand the tech, and the workings of hospitals, having a role in requirements conversations. Instead it was dominated by MPs, "patient advocacy" groups and privacy campaigners, none of whom know or care anything about how to build a workable system.

> But you can only know that by having a large sample of very “stable” (have few genetic irregularities) gene samples compared to a large pool of samples from people with very narrow and pronounced gene irregularities, right?

No- A "gene" isn't an A/G/C/T- it's a sequence of 1000-1000000 base pairs. Each gene has a well-defined start/stop sequence called a start/stop codon. When people have genetic differences, one (an SNP- single nucleotide polymorph) of the tens of thousands of base pairs in that gene is different. Even for genes that are entirely "missing" in some people, they're really just different in a way that makes them nonfunctional.

Does that make it obvious how sequencing all those genes is useful, even if everyone has different genes? It tells us 99.999% of how proteins are coded, even if individual variation is the other .001%.

It’s actually about 3 gigabases (ATCG). There are some recurrent features of the genome whose function we’ve worked out. For example the TATA box is a classic sequence that typically indicates the start of a part of the genome that codes for a protein. The vast majority of the genome doesn’t code for proteins. The function of these genome regions are much more murky. Some of these regions function like scaffolds for proteins to assemble into complexes. These protein complexes then start transcribing the genome into into mRNA. So the genome regulates its own expression, in a sense. Many of the sequences that function in this way are known. There are also just a bunch of parts of the genome that probably don’t do anything. There are also many regions of the genome that are basically self replicating sequences. They code for proteins that are capable of inserting their own genetic sequence back into the genome. These are transposons.

In short, a lot of very painstaking genetics and molecular biology work has gone into characterizing the function of certain sequences.

Also interesting are HERVs - human endogenous retroviruses which integrated into the human or our ancestor species’ genomes. They have degraded over time so none of the human hervs seem to be capable of activating but there are some in other mammals that can fully reactivate.

In humans even though hervs don’t reactivate into infectious viruses they have been implicated in both harmful (senescence during aging[0]) and beneficial (protection from modern retroviruses)[1] activities in the body.

They might be up to 8% of the human genome.

0: https://www.cell.com/cell/pdf/S0092-8674(22)01530-6.pdf

1:https://www.microbe.tv/twiv/twiv-956/

For the same reason Monsanto sequences basically anything: Because we can tell what proteins are encoded in there, and what is near them, and we can have good ideas of what proteins are expressed together. When dealing with genetic modification, we get to see whether our modification went in, and where it landed: Having a protein in a genome isn't enough. Its expression might be having an effect on other things, depending on where it is.

When we have baselines, we can compare different individuals, and eventually make predictions of how they are going to be based solely on the genetic code. If I know that a certain polymorphism is tied to some trait I want, I might not have to even bother spending the time growing a plant: I know that it's not what I want, and discard it as a seed.

With humans we are probably not going to see much modification soon, but just being able to detect genetic diseases, risk factors for other diseases that have genetic omponents, or allow for selection of embryos in cases of artificial insemination is already quite valuable.

It's not source code that we are all that good at understanding just yet, but there's already some applications, and we have good reason to think there's a lot more to come

It's just about 3 gigabytes (each byte a letter). Pretty mind-blowing, if you ask me.
It's a slight exaggeration of the information content to report the data size using an ASCII encoding. Since there are 4 bases, each can be encoded using 2 bits, rather than 8. So we're really talking 750 megabytes. But still mind-blowing.
And since the data is highly redundant the 750MB can be compressed down even further using standard approaches (DEFLATE works well, it uses both huffman coding and dictionary backreferences).

Or, you could build an embedding with far fewer parameters that could explain the vast majority of phenotypic differences. the genome is a hierarchical palimpsest of low entropy.

My standard interview question- because I hate leetcode- walks the interviewee through compressing DNA using bit encoding, then using that to implement a rolling hash to do fast frequency counting. Some folks get stuck at "how many bits in a byte", others at "if you have 4 symbols, how many bits are required to encode a symbol?", and other candidates jump straight to bloom filters and other probabilistic approaches (https://github.com/bcgsc/ntHash and https://github.com/dib-lab/khmer are good places to start if you are interested).

I'm curious if these 750MB + the DNA of mitochondria + the protein metagenomics contain all the information needed to build a human, or if there's extra info stored in the machinery of the first cell.

That is if we transfer the DNA to an advanced alien civilization - would they be able to make a human.

This is a complex question. The cocktail soup in a gamete (sperm or egg) and the resulting zygote contains an awful lot of stuff that would be extremely hard to replace. I could imagine that if the receiving civilization was sufficiently advanced and had a model of what those cells contained (beyond the genomic information) they could build some sort of artificial cell that could bootstrap the genome to the point of being able to start the development process. it would be quite an accomplishment.

If they just received the DNA without some information about the zygote, I don't think it would be practical for even advanced alien civilization (LR5 or LR6) but probably an LR7 and definitely an LR8 could.

I’m just pondering this, and it’s not clear to me that there is anything intrinsic in the genome itself that explicitly’says’ “this sequence of DNA bases encodes a protein” or even “these three base-pairs equate to this amino acid”.

I wonder if that information could ever really be untangled by a civilisation starting entirely from scratch without access to a cell

What do you mean by "LR"? I queried an LLM but no results there either.
The code how to build a sperm and an egg is inside the human DNA, isn't it?
Imagine a machine shop that has blueprints of components of the machines they use in the shop, and processes to assemble machines from the components. When a machine shop grows large and splits in two, each inherits a half of shop with the ongoing processes and a copy of the blueprints. https://m.youtube.com/watch?v=B7PMf7bBczQ&pp=QAFIAQ%3D%3D

DNA is the blueprints. There are infinite possibilities what to do with them. The advanced civilization would need additional information, like that they are supposed create a cell from the components to begin with, and a lot of detailed information how exactly to do that.

Edit: improved clarity

"if we transfer the DNA to an advanced alien civilization - would they be able to make a human."

I'm really surprised that in all these responses to your question no one's mentioned the womb or the mother, who (at least with current technology) is still necessary for making a human.

That's not to mention the necessity of the egg.

We're not just DNA.

This is a question about theoretical possibilities and what you're saying seems to be a rigid belief in an answer "no". But you provided no evidence or justification, except for "with current technology", which answers nothing about the theoretical question.
Artificial wombs have come quite a long way! It is not inconcievable to imagine that you could bring a zygote to term in an artificial womb.
Instructions on how to make a womb and an "egg" are contained within the human DNA.
> That is if we transfer the DNA to an advanced alien civilization - would they be able to make a human.

You'd need a cell to start the process, with the various nucleic acids distributed correctly and proteins/energy with which to create further proteins using the information encoded by the DNA. Thus the civilization would need information about cells and a set of building blocks before being able to use the DNA.

The DNA contains all the code that creates and regulates the proteins.
I can’t wait until we can bootstrap a human from a stage 3 tarball.
Yes, there is extra information in the first cells, in particular regulatory elements such as miRNAs. The headline here is epigenetics.
There's also some interesting work on understanding the roll of loops in the physical structure of the DNA storage on gene expression. [0] The base sequence of the DNA isn't everything; it may also matter how the DNA gets laid out in space---a feature which can be inherited.

[0] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2638769/

Our DNA does not contain the mitochondria nor the gut bacteria so the raw data would most certainly not be enough to build a working copy
it's bit like- if i have source code of Linux (think DNA), can I build a machine running Linux? (think cell). no- you cant, you need to have machine that can run the code.

ie. "software" without "machine" to run it on, is kind of a useless.

Yes, and if you gzip it it's even smaller. But the big takeaway is that the amount of info that fully defines a human, is what we consider "not much data," even in its plainest encoding.
We don't know that it fully defines a human until we can create one without the starting condition of being inside another human. It's prototype-based inheritance.
Some of the research about being able to make simple animals grow structures from other animals in their evolutionary “tree” by changing chemical signaling—among other wild things like finding that memories may be stored outside the brain, at least in some animals—makes me think you need more than just the “code” to get the animal that would have been produced if that “code” were in its full context (of a reproductive cell doing all sorts of other stuff). Even if the dna contains the instructions for that reproductive cell, too, in some sense… which instructions do you “run”? There might be multiple possible variants, some of which don’t actually reproduce the animal you took the dna from.
bzip2 is marginally better, and then genome-specific compressors were developed, and then finally, people started storing individual genomes as diffs from a single reference, https://en.wikipedia.org/wiki/CRAM_(file_format)

Since genome files contain more data than just ATGC (typically a comment line, then a DNA line, then a quality score line), and each of those draws from a different distribution, DEFLATE on a FASTA file doesn't reach the full potential of the compressor because the huffman table ends up having to hold all three distributions, and the dictionary backlookups aren't as efficient either. It turns out you can split the file into multiple streams, one per line type, and then compress those independently, with slightly better compression ratios, but it's still not great.

You could say exactly the same of all data; it's just 1s and 0s, but when I look I just see blonde, brunette.
If you think like an ML engineer, the genome is a feature vector 3B bases (or 6B binary bits) long that is highly redundant (many sections contain repeats and other regions that are correlated to other regions), and the mapping between that feature vector and an individual's specific properties (their "phenotype", which could be their height at full maturity, or their eye color, or hair properties, or propensity to diseases, etc) is highly nonlinear.

If you had a list of all the genomes of all the people in the world, and all their phenotypes (height, eye color, hair type, etc), you could take all their genomes as input variables and treat all their phenotypes as output variables, and make embeddings or other models that mapped from genomes to phenotypes. The result would be a predictive model that could take a human genome, and spit out a prediction of what that person looks like and other details around them (up to the limits of heritability).

A good example is height. If you take a very large diverse sample of people, and sequence them, you will find that about 50% of the variance in height can be traced to the genomic sequence of that individual (other things, such as socioeconomic status, access to health care, pollution, etc, which are non-genomic, contribute as well). originally many geneticists believed that a small number of genes- tiny parts of the feature vector- would be the important features in the genome that explained height.

But it didn't turn out that way. Instead, height is a nonlinear function of thousands of different locations (either individual bases, entire genes, or other structures that vary between individuals) in the genome. This was less surprising to folks who are molecular biologists (mainly based on the mental models geneticists and MBers use to think about the mapping of genotype to phenotype), and we still don't have great mechanistic explanations of how each individual difference works in concert with all the others to lead to specific heights.

When I started out studying this some 35 years ago the problem sounded fairly simple, I assumed it would be easy to find the place in my genome that led to my funny shaped (inherited) nose, but the more I learn about genomics and phenotypes, the more I appreciate that the problem is unbelievably complex, and really well suited to large datasets and machine learning. All the pharma have petabytes of genome sequences in the cloud that they try hard to analyze but the results are mixed.

I spent my entire thesis working on ATGCAAAT, by the way. https://en.wikipedia.org/wiki/Octamer_transcription_factor is a family of proteins that are incredibly important during growth and development. Your genome is sprinkled with locations that contain that sequence- or ones like it- that are used to regulate the expression of proteins to carry out the development plan.

> If you had a list of all the genomes of all the people in the world, and all their phenotypes (height, eye color, hair type, etc), you could take all their genomes as input variables and treat all their phenotypes as output variables, and make embeddings or other models that mapped from genomes to phenotypes. The result would be a predictive model that could take a human genome, and spit out a prediction of what that person looks like and other details around them (up to the limits of heritability).

Would such a predictive model really be possible? As far as I'm aware there is contradicting research whether a specific phenotype distinctly originates from a SNP/genotype.

I can't technically say with 100% confidence that it would be possible. It does seem extremely likely based on all the evidence I've seen over the past 30 years.

The model would be highly nonlinear and nonlocal, at the very least.

Fascinating, are there lots of people looking at genetics with this ML kind of lens?
Sure, although I'm not aware of anybody who is contemplating quite the level I believe is necessary to really nail the problem into the ground. When I worked at Google, I proposed that Google build a datacenter-sized sequencing center in Iowa or Nebraska near its data centers, buy thousands of sequencers, and run industrial-scale sequencing, push the data straight to the cloud over fat fiber, followed by machine learning, for health research. I don't think Google wants to get involved in the physical sequencing part but they did listen to my ideas and they have several teams working on applying ML to genomics as well as other health research problems, and my part of my job today (working at a biotech) is to manage the flows of petabytes of genomic data into the cloud and make it accessible to our machine learning engineers.

The really interesting approaches these days, IMHO, combine genomics and microscopic imaging of organoids, and many folks are trying to set up a "lab in the loop", in which large-scale experiments run autonomously by sophisticated ML systems could accelerate discovery. It's a fractally complex and challenging problem.

Statistics has been key to understanding genetics from the beginning (see Mendel, Fisher) and so at a big pharma you will see everything from Bayesian bootstrappers using R to deep learners using pytorch.

Guys at Verily are working on Terra.bio with the Broad institute and others. Genomics England in the UK is also experiencing with multimodal and machine learning applied to whole genome sequences [1].

[1] https://www.genomicsengland.co.uk/blog/data-representations-...

But why Google? This is what big pharma are doing. Also you can outsource the data collection part. See for example UK Biobank. Their data are available to multiple companies after some period so it makes it more cost efficient.
Why Google? Because this is a big data problem and Google mastered big data and ML on big data a long time ago. Most big pharma hasn't completely internalized the mindset required to do truly large-scale data analysis.
I have spent the better part of the past year looking obsessively over genomics papers for cancer and I've grown very fond of the field.

Are there any positions at Google/ companies you wold suggest me to look into? I'm coming from algortrading/ ML research with ML MSc.

You could try Calico. They are an Alphabet company that specifically studies aging. There how a good amount of machine learning roles. However biotech typically pays less than finance or software.

https://calicolabs.com/careers/

Yes. For example when word2vec came out, immediately there were people trying similar approaches to protein sequences. Transformers work better.
The genetic code maps nucleotide sequences (DNA) to amino acid sequences (proteins). Every three bases (say AGT) maps to one amino acid. So you can literally read a sequence of ACGTs and decode it into a protein. A sequence that encodes a protein is called a gene.

Almost all variations that humans have in their genomes (compared to each other or a reference genome) are tiny, mostly one base differences called single nucleotide polymorphisms (SNPs). These tiny changes encode who you are. The rest of it just makes you carbon-based organism, a eukaryote, an animal, a mammal etc, just like a whole load of other organisms.

I always make this mistake too as a computational biologist; when talking about DNA it’s megabases not megabytes.
What you are getting at is now called a “pangenome assembly”. Several high profile papers earlier this year, one by Guarracino and Garrison in Nature.

A pangenome is a complex graph model that weaves together hundreds or more genomes/haplotypes—usually of one species, but the idea can extend across species too, or even cells within one individual (think cancer pangenomes).

On the idealized human pangenome graph each human is represented by two threads along each autosome, plus threads through Chr X, Y, and the mitochondrial genome.

While you are correct, the differences between different people's DNA is tiny, less 1% at best. So this information is still very valuable. This article is talking about the first time in finishing sequencing one person's Y chromosome's DNA.
> While you are correct, the differences between different people's DNA is tiny, less 1% at best

How do we know this, if we have only sequenced the chromosome of one individual?

We have sequenced the genome using different sampling and statistical models for a long time.
Traditional sequences of the Y chromosome (and other chromosomes) were missing parts, particularly the highly repetitive regions called "telomeres". This is different from the issue of individual variation (although the authors do provide a map of known variations as well).
The title of the original article and as submitted here seems quite clear that it's of a specific individual: "The complete sequence of *a* human Y chromosome"
Thank you for putting into words exactly the thing I wanted to understand but couldn’t figure out how to ask.
Good question. Practically they call their complete sequence a 'reference sequence' which can be thought of as a baseline for comparison to the complete spectrum of human Y chromosome genetic variation, so at least people have something to use as a standard for comparison. The line in the abstract "mapped available population variation, clinical variants" is about the only mention of the issue.

Ideally we'd have hundreds if not thousands of complete genomes which in total would reveal the population diversity of the human species as it currently exists, but this is a big ask. "Clincal variants" are of particular interest as those are regions of the genome associated with certain inherited diseases, although the promises of individual genomic knowledge leading to a medical revolution have turned out to be wildly overblown.

Since the paper is paywalled, there's not much else to say than that they have a (fairly arbitrary in origin, i.e. it could have been from any one individual or possibly even a chimera of several individuals) reference sequence to which other specific human Y chromosomes can be compared, eventually leading to a larger dataset from many individuals which will reveal the highly conserved and highly variable regions of the chromosome, population-wise.

It means that they are trying to find a baseline from which they can eventually clone a human being.

Let's not pretend that this is not an end goal. It always was.

I had the same question. Perhaps this will help you.

https://en.m.wikipedia.org/wiki/DNA_sequencing

That page describes the human genome as having been sequenced back in 2003. *confusion intensifies*
the project started around 1990, they announced a draft completed in 2000, "completion" in 2003 (this was more a token announcement based on a threshold than a true milestone). Even then the scientists knew that major parts of the centromere, telomere, and highly repetitive regions were not fully resolved, and that was fully admitted. The work by Karen Miga at UCSC and others is more of a mop-up now that genome sequencing is a mature technology and we have much better ways at getting at those tricky regions.

another "completion" happened 3 years ago, before this announcement. but this is the last one. I promise.