| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jinto36 1310 days ago

Protein structure prediction was a huge deal, which is why AlphaFold received so much fanfare. It is actually pretty good. The next step is to predict where multi-protein complexes would interact- which is not just as simple as predicting the structure of two proteins independently and then trying to fit them together like a puzzle, because the the interactions can also change the structure. While it's not as hard as it used to be to experimentally determine protein targets of, for example, a protein kinase, it's still not an arbitrary or cheap experiment, and to do that for the many thousands of such proteins, across different conditions (stress, presence of co-factors, etc) and in different organisms would be rather a lot of work. Something like alphafold that makes reasonable predictions and can be used to help you focus on what's most likely to be relevant to your disease or process of interest helps quite a bit.

There's also more need for integrating "multi-omics" data, where you have data from multiple assays (gene expression, phospho-proteomics, lipidomics, epigenetics, small RNA expression, etc etc) with the goal of somehow combining all these different assay results from various levels of gene regulation, to get closer to figuring out actual mechanism for complex processes. Building on that, we can also do single-cell multi-omics to some extent- where you have results from different sequencing-based assays on the level of the same individual cell. This is still pretty limited, but it's exciting and advancing pretty quickly. This will eventually be combined with things like spatial transcriptomics, which is useful for mapping out what's going on in heterogeneous tissue samples like tumors, for example, so we'll end up with spatial single-cell multi-omics, at which point you're looking at 1) some quantitative trait for multiple genes/loci/molecules, and often 10k+ of such features at the same time per assay, 2) multiple assays, such as DNA accessibility and gene expression, in 3) single-cells, of which you might have 10k of in a single sample, 4) across a physical tissue sample where individual cells are spatially mapped, and where you probably want to figure out how cells might influence the state of those around them, and 5) in multiple different samples, where you might want to compare disease vs control, or look for correlation to heterogeneity of results within one group.

There's a lot of public data already available for single-cell gene expression projects if you want to get a feel for how these things are structured and how (passable but not amazing) the existing tooling is- one of the main repositories for this data is the NCBI's SRA https://www.ncbi.nlm.nih.gov/sra but you'll quickly note that searching and browsing is not as easy as you might think it would be- because one of the main limiting factors in bioinformatics is how bad everyone is at keeping terminology consistent. For many bioinformaticians, a majority of time is spent in the data cleaning phase. It's awful. Sometimes the experimental parameters make it into SRA or GEO, but sometimes you have to read through the associated paper to pull that out. Often it's only large consortium projects like the The Cancer Genome Atlas (TCGA) or the Genotype-Tissue Expression project (GTEx) - which have enough funding for staff dedicated to data management- end up publishing datasets that are easy to "consume" without having to jump through a whole bunch of hurdles to figure out how the data was produced.

I have a BS/MS in bioinformatics and I'm presently a PhD candidate in genetics and computational biology defending in February.

1 comments

pengwing 1309 days ago

So if I understood you correctly then further lowering the cost of experimentally determining protein targets could be a viable way forward that is completely orthogonal to computational methods?

link