Hacker News new | ask | show | jobs
by chrisamiller 4889 days ago
Some thoughts on this article:

- This guy clearly has a limited understanding of the field. This quote is laughable: "There are only two computationally difficult problems in bioinformatics, sequence alignment and phylogenetic tree construction."

- As a bioinformatician, I feel sorry for this guy. Just like any other field, there are shitty places to work. If I was stuck in a lab where a demanding PI with no computer skills kept throwing the results of poorly designed experiments at me and asking for miracles, I'd be a little bitter too.

- Just like any other field, there are also lots of places that are great places to work and are churning out some pretty goddamn amazing code and science. I'm working in cancer genomics, and we've already done work where the results of our bioinformatic analyses have saved people's lives. Here's one high-profile example that got a lot of good press. (http://www.nytimes.com/2012/07/08/health/in-gene-sequencing-...)

- I'm in the field of bioinformatics to improve human health and understand deep biological questions. I care about reproducibility and accuracy in my code, but 90% of the time, I could give a rat's ass about performance. I'm trying to find the answer to a question, and if I can get that answer in a reasonable amount of time, then the code is good enough. This is especially true when you consider that 3/4 of the things I do are one-off analyses with code that will never be used again. (largely because 3/4 of experiments fail - science is messy and hard like that). If given a choice between dicking around for two weeks to make my code perfect, or cranking out something that works in 2 hours, I'll pretty much always choose the latter. ("Premature optimization is the root of all evil (or at least most of it) in programming." --Donald Knuth)

- That said, when we do come up with some useful and widely applicable code, we do our best to optimize it, put it into pipelines with robust testing, and open-source it, so that the community can use it. If his lab never did that, they're rapidly falling behind the rest of the field.

- As for his assertion that bad code and obscure file formats are job security through obscurity, I'm going to call bullshit. For many years, the field lacked people with real CS training, so you got a lot of biologists reading a perl book in their spare time and hacking together some ugly, but functional solutions. Sure, in some ways that was less than optimal, but hell, it got us the human genome. The field is beginning to mature, and you're starting to see better code and standard formats as more computationally-savvy people move in. No one will argue that things couldn't be improved, but attributing it to unethical behavior or malice is just ridiculous.

tl;dr: Bitter guy with some kind of bone to pick doesn't really understand or accurately depict the state of the field.

4 comments

" I could give a rat's ass about performance. I'm trying to find the answer to a question, and if I can get that answer in a reasonable amount of time, then the code is good enough"

This is the only bad point that a lot of people are aligned with.

The more time a program needs to finish, the more time you will need to run it again with some other dataset, and in turn - more time to find the right answer.

I really feel that people with scientific and mathematics background should learn proper programming (not take a course in some language - but have actual experience). Design patterns, data structures, best practices, memory consumption, are all things that should be known before a person starts submitting code for this kind of projects.

Spending time optimizing a program that you will use once is a waste. Sitting idle while waiting for a program to finish is also a waste. So I think it's reasonable to optimize for programmer time the first time, and then re-visit the design if you discover the code is getting reused and fed larger data sets.
Want to teach us? A bunch of us work right near AT&T park in Mission Bay and would love to learn. Even a long day or two from you guys would be awesome. But as was eluded to, we can't pay you - we're poor as shit - especially when compared with you all.
What's the backstory on the author's tangent about the human genome? It sounded like the human genome project didn't actually do what the name implies.
Tell that to the tens of thousands of researchers who make use of the human reference genome daily. I don't even know what the guy is talking about there - imagining modern genetics or genomics without it is pretty much impossible.
The problem with bioinformatics is not "prematured optimization", but rather no optimization at all.
Out of curiosity, what other computationally difficult problems are there?

I'm very interested in bioinformatics, but sadly don't know as much about the field as I'd like.

1. gene networks is a big one: some proteins turn genes on or off. Some of those genes get translated into other proteins that turn genes on or off. How can you infer the interactions from experimental data? How can you figure out what these complex networks DO? 2. Predicting gene expression: where do proteins bind to the DNA? How can you predict what these proteins do once they are bound ( add chemical tags to structural proteins, knock off structural proteins by bending DNA, etc)? How can you predict how frequently the gene will be transcribed? How does the 3D shape of the DNA effect this?

These are just two of many questions ( biased towards my research interests of course ). It is really funny that he mentions sequence alignment and phylogenetically as the two big problems, because people generally consider these to be boring, uncool, solved-well-enough-for-our-purposes problems nowadays and just trust the algorithms described by Durbin decades ago. It sounds like the writer really doesn't know bioinformatics that well...

One that comes immediately to mind is genome assembly, which is a hugely complex problem, and essential to a variety of fields that rely on re-piecing together the genome without a reference (or with a reference that is highly divergent from the sequence data).
Genome assembly relies heavily on sequence alignment. So: Is genome assembly hard just because sequence alignment is hard? Or would genome assembly present separate algorithmic problems even if there was a super-efficient solution to sequence alignment?
It is far more difficult than sequence alignment. Sequence alignment has quadratic complexity, while fragment assembly is NP-hard. Se for example

http://scholar.google.com/scholar?cluster=131745416915434219...

Yes, for pairwise sequence alignment. The globally optimized multiple sequence alignment problem is NP-complete.
These are different sorts of alignments, with different sorts of math behind them.

Genome assembly is the shortest common super sequence problem. It involves finding the best rearrangement and overlap of reads which minimize the overall sequence, given the expected errors in the read technology. It would still be hard even if all of the reads were perfect.

Sequence alignment looks at two or more sequences in their entirety, and does a best fit alignment using a given model of how substitutions and gaps can occur. This model may be based on chemical or evolutionary knowledge.

A "super-efficient solution to sequence alignment" doesn't lead to a way to tell how the reads should be assembled into a single large sequence, even ignoring possible read errors.

An extra difficulty with genome assembly is that DNA often has lots and lots of repeated junk sequences that can confuse the algorithms. I don't work with bioinformatics to know how they usually get around this though.
Repeats aren't necessarily junk (e.g. TAL Effectors http://en.wikipedia.org/wiki/TAL_effector#DNA_recognition). Resolving them requires long reads. PacBio is currently of interest as an alternative to Sanger sequencing for this, although the error rate of PacBio reads is a bit of an issue.
pacbio is dead, they just don't know it yet. BGI (or somebody, doesn't matter, BGI is just the obvious candidate) would need to buy 50 SMART sequencers a year just for PacBio to stay in business. That seems unlikely given the lower cost and complexity of Illumina and Life sequencers
I do PhD research in metabolomics -- one of the latest omics in bioinfo-- with the CS department in my university. At the moment, we're working on alignment and identification of metabolite data. The data is not big in the sense of genomics data, but messy and complex due to the nature of the instruments (mass spectrometer), which will not get better THAT much in the foreseeable future.

Definitely a computationally difficult problem because while naive approaches work, they produce crappy results, wasting the result of tens of thousands of dollars of experiments. I see a big move towards applying statistical/machine learning methods, and graph theory stuffs in our field.

A lot of the rants in the original article are correct, with regards to prototyping and throwaway codes. That's because researchers are rushing to get an MVP out. The truly good ones got turned into (usually open-source) products, where the code quality hopefully improves a fair bit.

If you're a CS person who's interested or considering a move into bioinfo, I wrote a blog post about it recently: http://www.joewandy.com/2013/01/getting-into-bioinformatics....

Protein folding is an interesting and computational challenging task. So challenging that some groups have sort of given up on it and move to other fields. Look up David Baker and Rosetta for more info. This is just an example, there are many many problems to work on. I feel sorry for the author of the post, bioinformatics is only getting more interesting as our capacity to make experimental measurements grow. There have been so many interesting findings that are just the product of bioinformaticians digging into existing databases and analyzing them to come up with new theories that have since then been experimentally validated.
any type of network reconstruction - gene - gene / protein - protein , gene - protein , interaction network are all very challenging and important computational problems in biology