Hacker News new | ask | show | jobs
by dekhn 1034 days ago
If you think like an ML engineer, the genome is a feature vector 3B bases (or 6B binary bits) long that is highly redundant (many sections contain repeats and other regions that are correlated to other regions), and the mapping between that feature vector and an individual's specific properties (their "phenotype", which could be their height at full maturity, or their eye color, or hair properties, or propensity to diseases, etc) is highly nonlinear.

If you had a list of all the genomes of all the people in the world, and all their phenotypes (height, eye color, hair type, etc), you could take all their genomes as input variables and treat all their phenotypes as output variables, and make embeddings or other models that mapped from genomes to phenotypes. The result would be a predictive model that could take a human genome, and spit out a prediction of what that person looks like and other details around them (up to the limits of heritability).

A good example is height. If you take a very large diverse sample of people, and sequence them, you will find that about 50% of the variance in height can be traced to the genomic sequence of that individual (other things, such as socioeconomic status, access to health care, pollution, etc, which are non-genomic, contribute as well). originally many geneticists believed that a small number of genes- tiny parts of the feature vector- would be the important features in the genome that explained height.

But it didn't turn out that way. Instead, height is a nonlinear function of thousands of different locations (either individual bases, entire genes, or other structures that vary between individuals) in the genome. This was less surprising to folks who are molecular biologists (mainly based on the mental models geneticists and MBers use to think about the mapping of genotype to phenotype), and we still don't have great mechanistic explanations of how each individual difference works in concert with all the others to lead to specific heights.

When I started out studying this some 35 years ago the problem sounded fairly simple, I assumed it would be easy to find the place in my genome that led to my funny shaped (inherited) nose, but the more I learn about genomics and phenotypes, the more I appreciate that the problem is unbelievably complex, and really well suited to large datasets and machine learning. All the pharma have petabytes of genome sequences in the cloud that they try hard to analyze but the results are mixed.

I spent my entire thesis working on ATGCAAAT, by the way. https://en.wikipedia.org/wiki/Octamer_transcription_factor is a family of proteins that are incredibly important during growth and development. Your genome is sprinkled with locations that contain that sequence- or ones like it- that are used to regulate the expression of proteins to carry out the development plan.

2 comments

> If you had a list of all the genomes of all the people in the world, and all their phenotypes (height, eye color, hair type, etc), you could take all their genomes as input variables and treat all their phenotypes as output variables, and make embeddings or other models that mapped from genomes to phenotypes. The result would be a predictive model that could take a human genome, and spit out a prediction of what that person looks like and other details around them (up to the limits of heritability).

Would such a predictive model really be possible? As far as I'm aware there is contradicting research whether a specific phenotype distinctly originates from a SNP/genotype.

I can't technically say with 100% confidence that it would be possible. It does seem extremely likely based on all the evidence I've seen over the past 30 years.

The model would be highly nonlinear and nonlocal, at the very least.

Fascinating, are there lots of people looking at genetics with this ML kind of lens?
Sure, although I'm not aware of anybody who is contemplating quite the level I believe is necessary to really nail the problem into the ground. When I worked at Google, I proposed that Google build a datacenter-sized sequencing center in Iowa or Nebraska near its data centers, buy thousands of sequencers, and run industrial-scale sequencing, push the data straight to the cloud over fat fiber, followed by machine learning, for health research. I don't think Google wants to get involved in the physical sequencing part but they did listen to my ideas and they have several teams working on applying ML to genomics as well as other health research problems, and my part of my job today (working at a biotech) is to manage the flows of petabytes of genomic data into the cloud and make it accessible to our machine learning engineers.

The really interesting approaches these days, IMHO, combine genomics and microscopic imaging of organoids, and many folks are trying to set up a "lab in the loop", in which large-scale experiments run autonomously by sophisticated ML systems could accelerate discovery. It's a fractally complex and challenging problem.

Statistics has been key to understanding genetics from the beginning (see Mendel, Fisher) and so at a big pharma you will see everything from Bayesian bootstrappers using R to deep learners using pytorch.

Guys at Verily are working on Terra.bio with the Broad institute and others. Genomics England in the UK is also experiencing with multimodal and machine learning applied to whole genome sequences [1].

[1] https://www.genomicsengland.co.uk/blog/data-representations-...

But why Google? This is what big pharma are doing. Also you can outsource the data collection part. See for example UK Biobank. Their data are available to multiple companies after some period so it makes it more cost efficient.
Why Google? Because this is a big data problem and Google mastered big data and ML on big data a long time ago. Most big pharma hasn't completely internalized the mindset required to do truly large-scale data analysis.
I have spent the better part of the past year looking obsessively over genomics papers for cancer and I've grown very fond of the field.

Are there any positions at Google/ companies you wold suggest me to look into? I'm coming from algortrading/ ML research with ML MSc.

You could try Calico. They are an Alphabet company that specifically studies aging. There how a good amount of machine learning roles. However biotech typically pays less than finance or software.

https://calicolabs.com/careers/

Thanks!
Yes. For example when word2vec came out, immediately there were people trying similar approaches to protein sequences. Transformers work better.