Hacker News new | ask | show | jobs
by epistasis 4889 days ago
>There can be a big opportunity cost in trying to rework a workflow so that it is more efficient and then test it thoroughly ensure correctness.

Hi, I recognize your name as a legit bioinformatician, am a huge fan of the lab that you're currently in, and others should listen to you.

I'd like to add that for many projects, general reusable software engineering is not necessarily a huge advantage. Instead of verifying a single implementation, it's often better for somebody to reimplement the idea from scratch; if a second implementation in a different language written by a different programmer gets the same results, this is a much more thorough validation of the software than going over prototype software line by line.

Also, I've seen way too many software engineers come in with an enterprisey attitude of establishing all sorts of crazy infrastructure and get absolutely no work done. If Java is your idea of a good time, it's unlikely that you'll be an effective researcher (though it's not unheard of), because it's not good at maximizing single-programmer output, and not good at maximizing I/O or CPU or string processing. In research it's best to get results, fail fast fast fast, and move on to the next idea. If you're lucky, 1 in 20 will work out. Publish your crap, and if it's a good idea, it will be worth polishing the turd later, but it's better to explore the field then to spend too much time on an uninteresting area.

The only time you worry about efficiency is when it enables a whole other level of analysis. So, for example, UCSC does most of their work in C, including an entire web app and framework written in C, because when they were doing the draft assembly of human genome a decade ago on a small cluster of computers that they scrounged from secretaris' desks over the summer, Perl wouldn't cut it.

4 comments

Software engineering is important for bioinformatics, in my opinion. But it's important to identify the things that are important and aren't:

Reproducible code: extremely important. Correct code: extremely important. Readable code: very important. Efficient code: often not as important.

Even today, the UCSC Genome Browser is an example where efficient code is important. It is interactive software, has many human users who can work much efficiently when the browser is responsive. And with projects like ENCODE, there are now incredible amounts of data available from the browser that would not be easily possible with a less efficient system.

Very different from an analysis system that will be run a handful of times in batch mode.

>Reproducible code: extremely important. Correct code: extremely important. Readable code: very important. Efficient code: often not as important.

You want Haskell. :)

If Java is your idea of a good time, it's unlikely that you'll be an effective researcher (though it's not unheard of), because it's not good at maximizing single-programmer output, and not good at maximizing I/O or CPU or string processing.

FWIW, I have in the past gotten good results out of Java and C# (it's a lot easier in C#) by writing programs that generate bytecode at runtime, so they can use the JIT to further optimize performance. Getting the same results out of C would require a lot more work. This includes string processing - I wrote a regex compiler for Java at one point, easily outperforming java.util.regex.

And such things are not difficult - or at least, not difficult to me now, knowing all I know - perhaps 10 hours work for simple regex compiler. And that is how I would use tools like Java to optimize my own performance: adapt them to interpret or compile a language that is close to the problem domain. A slightly higher constant cost, with the aim of a much lower per-idea cost.

You describe N-version programming (though not by name). In actuality two different from-scratch implementations are likely to re-make the same mistakes, see the following. http://scholar.google.com/scholar?q=An+Experimental+Evaluati...
> and if it's a good idea, it will be worth polishing the turd later,

Which of the released turds do you consider to be polished?

Pretty much anything that gets used by many people ends up getting polished (the exception being the RNA-seq field, it's still pretty rough out there, but the research is still taking quite a while). And if you're writing software, your tool isn't going to get used until it's somewhat polished, or is so unique and essential in its purpose that people have to use it.

In terms of next-generation sequence analysis, Heng Li's BWA mapper and Samtools libraries are fairly good. His coding style is a bit terse for my tastes, but it keeps out people who don't know what they're doing, it's very clear code for semi-complicated algorithms, and BWA is some of the most reliable software I use everyday.

On the infrastructure side, Galaxy [https://main.g2.bx.psu.edu] is getting fairly good.

The BioConductor repository of R packages is extremely mixed. I don't like some of their architectural choices, but it's ended up working out OK.

I still use Michael Eisen's Cluster from a decade ago, along with Java TreeView.

Regarding samtools, it doesn't sound very good from what I'm hearing:

"Look at the disgusting state of the samtools code base. Many more cycles are being used because people write garbage. For a tool that is intimately tied to research, the absence of associated code commentary and meaningful commit messages is very poor. The code itself is not well self documenting either."

commit log:

http://samtools.svn.sourceforge.net/viewvc/samtools/trunk/sa...

code:

http://samtools.svn.sourceforge.net/viewvc/samtools/trunk/sa...

I can't find that critique with Google.

As I said, the style is very terse, and I have my suspicions that this is by design to minimize the number of less-qualified programmers trying to submit sub-standard code back to the project. (Edit: since it's been 10 minutes and I still can't reply to tomalsky's comment, I should point out that my "suspicions" are a joke; read the linked code sample and judge its quality for yourself.)

I have dived deep into the samtools code, rewritten chunks of file I/O inside it, messed with alternate formats, and my personal experience is that it's been easier for me to change, adapt, and understand it then any other open-source C project I've tried to dive into, such as, say, GNU join.

If anybody can point to where samtools is using many more cycles than it has to, please let me know! The worst part about it is that the compression and decompression is not multithreaded, but that is being worked out, I believe.

> I can't find that critique with Google.

I didn't link the source because I am ashamed to admit that I clicked on the reddit link that was posted in this thread.

http://www.reddit.com/r/bioinformatics/comments/179e9k/a_far...

samtools is among the better software in sequencing-data analysis. It is also great in defining formally the file format it uses. The same cannot be said of many other tools. For my recent work in RNA-Seq, samtools is the most robust and trustworthy tool I looked at and used, and I looked at almost every popular tool, the major exception being RSEM. If only all bioinformatics tools are more like samtools.
It works just fine. And iterating through files is not rocket science, anyway; it's not hard to follow what is going on.