Hacker News new | ask | show | jobs
by chihuahua 1031 days ago
I've never understood this either. I assume the genome is many megabytes of [ATCG]+. If we have that sequence, what does it tell us? Do we look at it and say "Ah, yes, ...ATGCTACGACTACGACTAGCG... very interesting?"
9 comments

Many genes are highly conserved or consistent enough. E.g.: if there's a 1% difference between two people, then it's a bit like two very unique sentences that have a couple of small typos. They're sill recognisable, and it's also still pretty obvious that they're the "same".

A gene sequence allows researchers to determine the amino acids that are coded for, and from those, which proteins match which genes.

This can be matched up with genetic diseases. If you know that damage to a certain location in a chromosome causes a problem with a certain biological process, then ergo, the associated protein is needed for that process!

So: genetic illness -> gene sequence -> protein -> role in the body

Without sequencing, that chain can't be built.

But you can only know that by having a large sample of very “stable” (have few genetic irregularities) gene samples compared to a large pool of samples from people with very narrow and pronounced gene irregularities, right?

Is this why it’s so hard? This feels more like a healthcare records keeping people and less like an “actually reading the data problem”.

I can’t help but feel like some form of single payer healthcare is truly the way out of this problem. One where all disease record keeping is uniform and complete.

Single payer healthcare here (UK) is still subject to privacy controls in a way which would make it very difficult to do that.

(Also our health system's IT is a hellscape, but one reason for that is that people would literally rather not have a working system at all, than one with less than impeccable privacy controls.

Personally I'd gladly sacrifice a fair bit of medical privacy in return for giving scientists greater insight into disease processes, but the average citizen here wants advanced healthcare without giving their data to research scientists. /facepalm )

I trust the scientists, the problem isn't them. Look at the whole abortion data scandal in the U.S.
The problem there is the US' insane theocrat-conservatives (or just misogynist assholes hiding behind a thin veneer of religious justification, as the case may be).

I'm not saying a health IT system should have no privacy controls either. But the requirements for such controls need to be balanced against having a system that actually works, and that means having some people who actually understand the tech, and the workings of hospitals, having a role in requirements conversations. Instead it was dominated by MPs, "patient advocacy" groups and privacy campaigners, none of whom know or care anything about how to build a workable system.

> But you can only know that by having a large sample of very “stable” (have few genetic irregularities) gene samples compared to a large pool of samples from people with very narrow and pronounced gene irregularities, right?

No- A "gene" isn't an A/G/C/T- it's a sequence of 1000-1000000 base pairs. Each gene has a well-defined start/stop sequence called a start/stop codon. When people have genetic differences, one (an SNP- single nucleotide polymorph) of the tens of thousands of base pairs in that gene is different. Even for genes that are entirely "missing" in some people, they're really just different in a way that makes them nonfunctional.

Does that make it obvious how sequencing all those genes is useful, even if everyone has different genes? It tells us 99.999% of how proteins are coded, even if individual variation is the other .001%.

It’s actually about 3 gigabases (ATCG). There are some recurrent features of the genome whose function we’ve worked out. For example the TATA box is a classic sequence that typically indicates the start of a part of the genome that codes for a protein. The vast majority of the genome doesn’t code for proteins. The function of these genome regions are much more murky. Some of these regions function like scaffolds for proteins to assemble into complexes. These protein complexes then start transcribing the genome into into mRNA. So the genome regulates its own expression, in a sense. Many of the sequences that function in this way are known. There are also just a bunch of parts of the genome that probably don’t do anything. There are also many regions of the genome that are basically self replicating sequences. They code for proteins that are capable of inserting their own genetic sequence back into the genome. These are transposons.

In short, a lot of very painstaking genetics and molecular biology work has gone into characterizing the function of certain sequences.

Also interesting are HERVs - human endogenous retroviruses which integrated into the human or our ancestor species’ genomes. They have degraded over time so none of the human hervs seem to be capable of activating but there are some in other mammals that can fully reactivate.

In humans even though hervs don’t reactivate into infectious viruses they have been implicated in both harmful (senescence during aging[0]) and beneficial (protection from modern retroviruses)[1] activities in the body.

They might be up to 8% of the human genome.

0: https://www.cell.com/cell/pdf/S0092-8674(22)01530-6.pdf

1:https://www.microbe.tv/twiv/twiv-956/

For the same reason Monsanto sequences basically anything: Because we can tell what proteins are encoded in there, and what is near them, and we can have good ideas of what proteins are expressed together. When dealing with genetic modification, we get to see whether our modification went in, and where it landed: Having a protein in a genome isn't enough. Its expression might be having an effect on other things, depending on where it is.

When we have baselines, we can compare different individuals, and eventually make predictions of how they are going to be based solely on the genetic code. If I know that a certain polymorphism is tied to some trait I want, I might not have to even bother spending the time growing a plant: I know that it's not what I want, and discard it as a seed.

With humans we are probably not going to see much modification soon, but just being able to detect genetic diseases, risk factors for other diseases that have genetic omponents, or allow for selection of embryos in cases of artificial insemination is already quite valuable.

It's not source code that we are all that good at understanding just yet, but there's already some applications, and we have good reason to think there's a lot more to come

It's just about 3 gigabytes (each byte a letter). Pretty mind-blowing, if you ask me.
It's a slight exaggeration of the information content to report the data size using an ASCII encoding. Since there are 4 bases, each can be encoded using 2 bits, rather than 8. So we're really talking 750 megabytes. But still mind-blowing.
And since the data is highly redundant the 750MB can be compressed down even further using standard approaches (DEFLATE works well, it uses both huffman coding and dictionary backreferences).

Or, you could build an embedding with far fewer parameters that could explain the vast majority of phenotypic differences. the genome is a hierarchical palimpsest of low entropy.

My standard interview question- because I hate leetcode- walks the interviewee through compressing DNA using bit encoding, then using that to implement a rolling hash to do fast frequency counting. Some folks get stuck at "how many bits in a byte", others at "if you have 4 symbols, how many bits are required to encode a symbol?", and other candidates jump straight to bloom filters and other probabilistic approaches (https://github.com/bcgsc/ntHash and https://github.com/dib-lab/khmer are good places to start if you are interested).

I'm curious if these 750MB + the DNA of mitochondria + the protein metagenomics contain all the information needed to build a human, or if there's extra info stored in the machinery of the first cell.

That is if we transfer the DNA to an advanced alien civilization - would they be able to make a human.

This is a complex question. The cocktail soup in a gamete (sperm or egg) and the resulting zygote contains an awful lot of stuff that would be extremely hard to replace. I could imagine that if the receiving civilization was sufficiently advanced and had a model of what those cells contained (beyond the genomic information) they could build some sort of artificial cell that could bootstrap the genome to the point of being able to start the development process. it would be quite an accomplishment.

If they just received the DNA without some information about the zygote, I don't think it would be practical for even advanced alien civilization (LR5 or LR6) but probably an LR7 and definitely an LR8 could.

I’m just pondering this, and it’s not clear to me that there is anything intrinsic in the genome itself that explicitly’says’ “this sequence of DNA bases encodes a protein” or even “these three base-pairs equate to this amino acid”.

I wonder if that information could ever really be untangled by a civilisation starting entirely from scratch without access to a cell

If you knew what DNA was and had seen a protein you could easily figure out start/stop codons. If you had only seen something similar it would be harder. If you had nothing similar, I don't know.

Coding DNA and non-coding DNA looks very different. Proteins are full of short repetitive sequences that form structural elements like alpha helixes: https://en.wikipedia.org/wiki/Alpha_helix

Once you've identified roughly where the protein-coding genes are it would be trivial to identify 3'/5' as being common to all those regions. You could pretty easily imagine a much more complicated system with different transcription mechanisms and codon categories, but earth genomes are super simple in that respect. Once you have those you just have the (incredibly complex) problem of creating a polymerase and bam, you'll be able to print every single gene in the body.

Without the right balance of promoters/factors/polymerase you probably won't get anything close to a human cell, but you'd be able to at least work closer to what the natural balance should be, and once you get closer to building a correct ribosome etc the cell would start to self-correct.

It’s an interesting question. Naively, I would expect it to be about like reverse engineering a CPU from a binary program. Which sounds daunting but maybe not impossible if you understand the fundamentals of registers, memory, opcodes, etc.

But… doing so from first principles without a mental model of how all (human) CPUs work? I guess it comes down to whether the recipients had enough context to know what they’re looking at.

Yes, it's intrinsic in the genome but implemented through such a complicated mechanism that attempting to understand these things from first principles is impractical, not impossible.

In genomic science we nearly always use more cheaply available information rather than attempt to solve the hard problem directly. For example, for decades, a lot of sequencing only focused on the transcribed parts of the genome (which typically encode for protein), letting biology do the work for determining which parts are protein.

If you look at the process biophysically, you will see there are actual proteins that bind to the regions just before a protein, because the DNA sequences there match some pattern the protein recognizes. If you move that signal in front of a non-coding region, the apparatus will happily transcribe and even attempt to translate the non-coding region, making a garbage protein.

What do you mean by "LR"? I queried an LLM but no results there either.
It's likely just a typo. LR5 "civilisation"/"civilization" brings up nothing on google. I don't know why you would an LLM to know more.

Based on the way the person is using it, it does not seem to equate to the Kardashev scale, as my peer stated

oops i've said too much
The code how to build a sperm and an egg is inside the human DNA, isn't it?
Yes, but it currently requires developmentally mature individuals to build the gametes, and the "code" is so complex you couldn't really decipher it from first principles.
The code to build mitochondria is not.
Given code written for unknown hardware... can you execute it?
Imagine a machine shop that has blueprints of components of the machines they use in the shop, and processes to assemble machines from the components. When a machine shop grows large and splits in two, each inherits a half of shop with the ongoing processes and a copy of the blueprints. https://m.youtube.com/watch?v=B7PMf7bBczQ&pp=QAFIAQ%3D%3D

DNA is the blueprints. There are infinite possibilities what to do with them. The advanced civilization would need additional information, like that they are supposed create a cell from the components to begin with, and a lot of detailed information how exactly to do that.

Edit: improved clarity

"if we transfer the DNA to an advanced alien civilization - would they be able to make a human."

I'm really surprised that in all these responses to your question no one's mentioned the womb or the mother, who (at least with current technology) is still necessary for making a human.

That's not to mention the necessity of the egg.

We're not just DNA.

This is a question about theoretical possibilities and what you're saying seems to be a rigid belief in an answer "no". But you provided no evidence or justification, except for "with current technology", which answers nothing about the theoretical question.
Artificial wombs have come quite a long way! It is not inconcievable to imagine that you could bring a zygote to term in an artificial womb.
Instructions on how to make a womb and an "egg" are contained within the human DNA.
It is known that that is not true, due to the distinct genetic code of mitochondria and known epigenetic influences of mothers on their children in utero.

You could say “well that's the last 10% of the details, maybe 90% is in the DNA,” but I think I would be suspicious that it's that high, because one of the things we know about humans is that we are born with all of the ova that we will ever have, rather than deferring the process until puberty. I should think that if it could be deferred it would have been, “you will spend the energy to make these 15 years before you need to for no real reason” seems very unlike evolution whereas “my body is going to teach you how to make these eggs, just the same as my mother's body taught me,” sounds quite evolutionarily reasonable.

But maybe you needed a pre-human womb to bootstrap the first human, and we no longer have the blueprint for that...
> That is if we transfer the DNA to an advanced alien civilization - would they be able to make a human.

You'd need a cell to start the process, with the various nucleic acids distributed correctly and proteins/energy with which to create further proteins using the information encoded by the DNA. Thus the civilization would need information about cells and a set of building blocks before being able to use the DNA.

The DNA contains all the code that creates and regulates the proteins.
Including code for the proteins that read DNA to produce proteins. You might hit similar problems trying to understand C given the source code for a C compiler - a non-standard environment could reproduce itself given the source code, meaning the code alone doesn't strictly determine the output.
And how will you decode it?
I can’t wait until we can bootstrap a human from a stage 3 tarball.
Yes, there is extra information in the first cells, in particular regulatory elements such as miRNAs. The headline here is epigenetics.
There's also some interesting work on understanding the roll of loops in the physical structure of the DNA storage on gene expression. [0] The base sequence of the DNA isn't everything; it may also matter how the DNA gets laid out in space---a feature which can be inherited.

[0] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2638769/

Our DNA does not contain the mitochondria nor the gut bacteria so the raw data would most certainly not be enough to build a working copy
it's bit like- if i have source code of Linux (think DNA), can I build a machine running Linux? (think cell). no- you cant, you need to have machine that can run the code.

ie. "software" without "machine" to run it on, is kind of a useless.

Yes, and if you gzip it it's even smaller. But the big takeaway is that the amount of info that fully defines a human, is what we consider "not much data," even in its plainest encoding.
We don't know that it fully defines a human until we can create one without the starting condition of being inside another human. It's prototype-based inheritance.
Some of the research about being able to make simple animals grow structures from other animals in their evolutionary “tree” by changing chemical signaling—among other wild things like finding that memories may be stored outside the brain, at least in some animals—makes me think you need more than just the “code” to get the animal that would have been produced if that “code” were in its full context (of a reproductive cell doing all sorts of other stuff). Even if the dna contains the instructions for that reproductive cell, too, in some sense… which instructions do you “run”? There might be multiple possible variants, some of which don’t actually reproduce the animal you took the dna from.
My favorite trivia here is that flamingos aren't actually "genetically" pink but "environmentally" pink because they pick up the color from eating algae.

Except of course "genetics" and "environment" aren't actually separate things; sure, people's skin color isn't usually affected by their food, but only because most people don't eat colloidal silver.

https://en.wikipedia.org/wiki/Paul_Karason

bzip2 is marginally better, and then genome-specific compressors were developed, and then finally, people started storing individual genomes as diffs from a single reference, https://en.wikipedia.org/wiki/CRAM_(file_format)

Since genome files contain more data than just ATGC (typically a comment line, then a DNA line, then a quality score line), and each of those draws from a different distribution, DEFLATE on a FASTA file doesn't reach the full potential of the compressor because the huffman table ends up having to hold all three distributions, and the dictionary backlookups aren't as efficient either. It turns out you can split the file into multiple streams, one per line type, and then compress those independently, with slightly better compression ratios, but it's still not great.

You could say exactly the same of all data; it's just 1s and 0s, but when I look I just see blonde, brunette.
If you think like an ML engineer, the genome is a feature vector 3B bases (or 6B binary bits) long that is highly redundant (many sections contain repeats and other regions that are correlated to other regions), and the mapping between that feature vector and an individual's specific properties (their "phenotype", which could be their height at full maturity, or their eye color, or hair properties, or propensity to diseases, etc) is highly nonlinear.

If you had a list of all the genomes of all the people in the world, and all their phenotypes (height, eye color, hair type, etc), you could take all their genomes as input variables and treat all their phenotypes as output variables, and make embeddings or other models that mapped from genomes to phenotypes. The result would be a predictive model that could take a human genome, and spit out a prediction of what that person looks like and other details around them (up to the limits of heritability).

A good example is height. If you take a very large diverse sample of people, and sequence them, you will find that about 50% of the variance in height can be traced to the genomic sequence of that individual (other things, such as socioeconomic status, access to health care, pollution, etc, which are non-genomic, contribute as well). originally many geneticists believed that a small number of genes- tiny parts of the feature vector- would be the important features in the genome that explained height.

But it didn't turn out that way. Instead, height is a nonlinear function of thousands of different locations (either individual bases, entire genes, or other structures that vary between individuals) in the genome. This was less surprising to folks who are molecular biologists (mainly based on the mental models geneticists and MBers use to think about the mapping of genotype to phenotype), and we still don't have great mechanistic explanations of how each individual difference works in concert with all the others to lead to specific heights.

When I started out studying this some 35 years ago the problem sounded fairly simple, I assumed it would be easy to find the place in my genome that led to my funny shaped (inherited) nose, but the more I learn about genomics and phenotypes, the more I appreciate that the problem is unbelievably complex, and really well suited to large datasets and machine learning. All the pharma have petabytes of genome sequences in the cloud that they try hard to analyze but the results are mixed.

I spent my entire thesis working on ATGCAAAT, by the way. https://en.wikipedia.org/wiki/Octamer_transcription_factor is a family of proteins that are incredibly important during growth and development. Your genome is sprinkled with locations that contain that sequence- or ones like it- that are used to regulate the expression of proteins to carry out the development plan.

> If you had a list of all the genomes of all the people in the world, and all their phenotypes (height, eye color, hair type, etc), you could take all their genomes as input variables and treat all their phenotypes as output variables, and make embeddings or other models that mapped from genomes to phenotypes. The result would be a predictive model that could take a human genome, and spit out a prediction of what that person looks like and other details around them (up to the limits of heritability).

Would such a predictive model really be possible? As far as I'm aware there is contradicting research whether a specific phenotype distinctly originates from a SNP/genotype.

I can't technically say with 100% confidence that it would be possible. It does seem extremely likely based on all the evidence I've seen over the past 30 years.

The model would be highly nonlinear and nonlocal, at the very least.

Fascinating, are there lots of people looking at genetics with this ML kind of lens?
Sure, although I'm not aware of anybody who is contemplating quite the level I believe is necessary to really nail the problem into the ground. When I worked at Google, I proposed that Google build a datacenter-sized sequencing center in Iowa or Nebraska near its data centers, buy thousands of sequencers, and run industrial-scale sequencing, push the data straight to the cloud over fat fiber, followed by machine learning, for health research. I don't think Google wants to get involved in the physical sequencing part but they did listen to my ideas and they have several teams working on applying ML to genomics as well as other health research problems, and my part of my job today (working at a biotech) is to manage the flows of petabytes of genomic data into the cloud and make it accessible to our machine learning engineers.

The really interesting approaches these days, IMHO, combine genomics and microscopic imaging of organoids, and many folks are trying to set up a "lab in the loop", in which large-scale experiments run autonomously by sophisticated ML systems could accelerate discovery. It's a fractally complex and challenging problem.

Statistics has been key to understanding genetics from the beginning (see Mendel, Fisher) and so at a big pharma you will see everything from Bayesian bootstrappers using R to deep learners using pytorch.

Guys at Verily are working on Terra.bio with the Broad institute and others. Genomics England in the UK is also experiencing with multimodal and machine learning applied to whole genome sequences [1].

[1] https://www.genomicsengland.co.uk/blog/data-representations-...

But why Google? This is what big pharma are doing. Also you can outsource the data collection part. See for example UK Biobank. Their data are available to multiple companies after some period so it makes it more cost efficient.
Why Google? Because this is a big data problem and Google mastered big data and ML on big data a long time ago. Most big pharma hasn't completely internalized the mindset required to do truly large-scale data analysis.
I have spent the better part of the past year looking obsessively over genomics papers for cancer and I've grown very fond of the field.

Are there any positions at Google/ companies you wold suggest me to look into? I'm coming from algortrading/ ML research with ML MSc.

You could try Calico. They are an Alphabet company that specifically studies aging. There how a good amount of machine learning roles. However biotech typically pays less than finance or software.

https://calicolabs.com/careers/

Thanks!
Yes. For example when word2vec came out, immediately there were people trying similar approaches to protein sequences. Transformers work better.
The genetic code maps nucleotide sequences (DNA) to amino acid sequences (proteins). Every three bases (say AGT) maps to one amino acid. So you can literally read a sequence of ACGTs and decode it into a protein. A sequence that encodes a protein is called a gene.

Almost all variations that humans have in their genomes (compared to each other or a reference genome) are tiny, mostly one base differences called single nucleotide polymorphisms (SNPs). These tiny changes encode who you are. The rest of it just makes you carbon-based organism, a eukaryote, an animal, a mammal etc, just like a whole load of other organisms.

I always make this mistake too as a computational biologist; when talking about DNA it’s megabases not megabytes.