|
If you think like an ML engineer, the genome is a feature vector 3B bases (or 6B binary bits) long that is highly redundant (many sections contain repeats and other regions that are correlated to other regions), and the mapping between that feature vector and an individual's specific properties (their "phenotype", which could be their height at full maturity, or their eye color, or hair properties, or propensity to diseases, etc) is highly nonlinear. If you had a list of all the genomes of all the people in the world, and all their phenotypes (height, eye color, hair type, etc), you could take all their genomes as input variables and treat all their phenotypes as output variables, and make embeddings or other models that mapped from genomes to phenotypes. The result would be a predictive model that could take a human genome, and spit out a prediction of what that person looks like and other details around them (up to the limits of heritability). A good example is height. If you take a very large diverse sample of people, and sequence them, you will find that about 50% of the variance in height can be traced to the genomic sequence of that individual (other things, such as socioeconomic status, access to health care, pollution, etc, which are non-genomic, contribute as well). originally many geneticists believed that a small number of genes- tiny parts of the feature vector- would be the important features in the genome that explained height. But it didn't turn out that way. Instead, height is a nonlinear function of thousands of different locations (either individual bases, entire genes, or other structures that vary between individuals) in the genome. This was less surprising to folks who are molecular biologists (mainly based on the mental models geneticists and MBers use to think about the mapping of genotype to phenotype), and we still don't have great mechanistic explanations of how each individual difference works in concert with all the others to lead to specific heights. When I started out studying this some 35 years ago the problem sounded fairly simple, I assumed it would be easy to find the place in my genome that led to my funny shaped (inherited) nose, but the more I learn about genomics and phenotypes, the more I appreciate that the problem is unbelievably complex, and really well suited to large datasets and machine learning. All the pharma have petabytes of genome sequences in the cloud that they try hard to analyze but the results are mixed. I spent my entire thesis working on ATGCAAAT, by the way. https://en.wikipedia.org/wiki/Octamer_transcription_factor is a family of proteins that are incredibly important during growth and development. Your genome is sprinkled with locations that contain that sequence- or ones like it- that are used to regulate the expression of proteins to carry out the development plan. |
Would such a predictive model really be possible? As far as I'm aware there is contradicting research whether a specific phenotype distinctly originates from a SNP/genotype.