Hacker News new | ask | show | jobs
by 2bitencryption 1874 days ago
> Biotechnology sounds to me much like computing in the 60’s.

One thing I've always wondered about biotech... I imagine there are many non-obvious correlations and interactions in medicine, which would be easily detected using nothing more advanced than Excel-spreadsheet level data analysis.

Making up an example: people with a certain DNA trait/allele who also have a diet with a high amount of XYZ tend to not develop disease ABC as frequently as most people. Even if we don't know the pharmacological reason why that is, it would still massively benefit lots of people, right?

So it always seems to me like tech from 2007 was ready to tackle this problem. Dump in a bunch of anonymized data, find correlations, repeat.

But I feel like I never hear anything about this type of work. Is it happening, but not publicized much? Is it actually not as simple as it sounds? Does nature simply not work in this way?

Even if 95% of diseases are just "bad luck", I assume that other 5% is made up of environmental factors we don't yet understand, but could easily learn using well-known data processing techniques?

4 comments

This has been going on for decades, it's called GWAS [1] and it has had a few successes but basically hasn't worked as well as everyone hoped it would in the 90s. The reason it doesn't work that well is that the human genome has ~3 billion letters and human physiology is complex. So trying to establish stastitically significant correlations between genome variations and human physiology is hard and requires more than Excel. In fact, the computational tools that have been applied to this are incredibly sophisticated and are not the limiting factor. The limiting factor is that you probably need millions or billions of genomes to make it work, and we don't have that yet. Also people are beginning to realize that many disease-relevant traits are caused by rare variants (rather than obvious statistically significant correlations) which are quite hard to detect this way.

So... anyway you're right that this is a natural way to approach the question of understanding the genetic basis of disease and physiology. But it's been beaten to death and found to drive fewer insights than were hoped

[1] https://en.wikipedia.org/wiki/Genome-wide_association_study

Lasso (L2 regularised) regression was actually invented to solve these kinds of problems. To the best of my knowledge, this has not yet appeared in Excel.
As other commenters have mentioned, it is common practice to use such an approach to find genes that are associated with a disease (GWAS).

But finding a correlated gene is only the first step. One issue is that a single protein can participate in hundreds of seemingly unrelated chemical reactions throughout the body depending on the cell type and environment. So simply tweaking the genes expression will have many unintended consequences.

For instance, each cell is constantly maintaining a baffling complex balance between growing and dying. Any external perturbation has a good chance of either killing the cell or causing cancer.

I've found myself thinking the same. Maybe researchers don't have the data and/or the platform? Not sure who records what they eat, and if they do they don't share it?
This work has infact been happening for decades. You are right that a lot of bioinformatics is analysis of tabular data using statistical models.