Hacker News new | ask | show | jobs
by Inufu 4889 days ago
Out of curiosity, what other computationally difficult problems are there?

I'm very interested in bioinformatics, but sadly don't know as much about the field as I'd like.

5 comments

1. gene networks is a big one: some proteins turn genes on or off. Some of those genes get translated into other proteins that turn genes on or off. How can you infer the interactions from experimental data? How can you figure out what these complex networks DO? 2. Predicting gene expression: where do proteins bind to the DNA? How can you predict what these proteins do once they are bound ( add chemical tags to structural proteins, knock off structural proteins by bending DNA, etc)? How can you predict how frequently the gene will be transcribed? How does the 3D shape of the DNA effect this?

These are just two of many questions ( biased towards my research interests of course ). It is really funny that he mentions sequence alignment and phylogenetically as the two big problems, because people generally consider these to be boring, uncool, solved-well-enough-for-our-purposes problems nowadays and just trust the algorithms described by Durbin decades ago. It sounds like the writer really doesn't know bioinformatics that well...

One that comes immediately to mind is genome assembly, which is a hugely complex problem, and essential to a variety of fields that rely on re-piecing together the genome without a reference (or with a reference that is highly divergent from the sequence data).
Genome assembly relies heavily on sequence alignment. So: Is genome assembly hard just because sequence alignment is hard? Or would genome assembly present separate algorithmic problems even if there was a super-efficient solution to sequence alignment?
It is far more difficult than sequence alignment. Sequence alignment has quadratic complexity, while fragment assembly is NP-hard. Se for example

http://scholar.google.com/scholar?cluster=131745416915434219...

Yes, for pairwise sequence alignment. The globally optimized multiple sequence alignment problem is NP-complete.
These are different sorts of alignments, with different sorts of math behind them.

Genome assembly is the shortest common super sequence problem. It involves finding the best rearrangement and overlap of reads which minimize the overall sequence, given the expected errors in the read technology. It would still be hard even if all of the reads were perfect.

Sequence alignment looks at two or more sequences in their entirety, and does a best fit alignment using a given model of how substitutions and gaps can occur. This model may be based on chemical or evolutionary knowledge.

A "super-efficient solution to sequence alignment" doesn't lead to a way to tell how the reads should be assembled into a single large sequence, even ignoring possible read errors.

An extra difficulty with genome assembly is that DNA often has lots and lots of repeated junk sequences that can confuse the algorithms. I don't work with bioinformatics to know how they usually get around this though.
Repeats aren't necessarily junk (e.g. TAL Effectors http://en.wikipedia.org/wiki/TAL_effector#DNA_recognition). Resolving them requires long reads. PacBio is currently of interest as an alternative to Sanger sequencing for this, although the error rate of PacBio reads is a bit of an issue.
pacbio is dead, they just don't know it yet. BGI (or somebody, doesn't matter, BGI is just the obvious candidate) would need to buy 50 SMART sequencers a year just for PacBio to stay in business. That seems unlikely given the lower cost and complexity of Illumina and Life sequencers
I do PhD research in metabolomics -- one of the latest omics in bioinfo-- with the CS department in my university. At the moment, we're working on alignment and identification of metabolite data. The data is not big in the sense of genomics data, but messy and complex due to the nature of the instruments (mass spectrometer), which will not get better THAT much in the foreseeable future.

Definitely a computationally difficult problem because while naive approaches work, they produce crappy results, wasting the result of tens of thousands of dollars of experiments. I see a big move towards applying statistical/machine learning methods, and graph theory stuffs in our field.

A lot of the rants in the original article are correct, with regards to prototyping and throwaway codes. That's because researchers are rushing to get an MVP out. The truly good ones got turned into (usually open-source) products, where the code quality hopefully improves a fair bit.

If you're a CS person who's interested or considering a move into bioinfo, I wrote a blog post about it recently: http://www.joewandy.com/2013/01/getting-into-bioinformatics....

Protein folding is an interesting and computational challenging task. So challenging that some groups have sort of given up on it and move to other fields. Look up David Baker and Rosetta for more info. This is just an example, there are many many problems to work on. I feel sorry for the author of the post, bioinformatics is only getting more interesting as our capacity to make experimental measurements grow. There have been so many interesting findings that are just the product of bioinformaticians digging into existing databases and analyzing them to come up with new theories that have since then been experimentally validated.
any type of network reconstruction - gene - gene / protein - protein , gene - protein , interaction network are all very challenging and important computational problems in biology