Hacker News new | ask | show | jobs
by zerohp 4889 days ago
> the software is written to be inefficient, to use memory poorly, and the cry goes up for bigger, faster machines! When the machines are procured, even larger hunks of data are indiscriminately shoved through black box implementations of algorithms in hopes that meaning will emerge on the far side. It never does, but maybe with a bigger machine…

I spent five years working in bioinformatics, and this is exactly the attitude of both the researchers and the other developers on the projects I worked on. It was very frustrating.

3 comments

Hi, I'm a bioinformatics researcher. Apparently I work for this guy's ex(?)-employer although I have never heard of him before.

My single most limited resource is programmer time. My time and the time of other people who work with me. I have access to loads of computers that sit idle all the time, even if it is on nights and weekends. There is zero opportunity cost to me in using these computers more fully. I have enough human work to do that I can wait for the results without having any wait states.

There can be a big opportunity cost in trying to rework a workflow so that it is more efficient and then test it thoroughly ensure correctness. Doing this may seem more appealing to someone who is interested primarily in computational efficiency. But I am more interested in research efficiency, and so are my employers and funders.

>There can be a big opportunity cost in trying to rework a workflow so that it is more efficient and then test it thoroughly ensure correctness.

Hi, I recognize your name as a legit bioinformatician, am a huge fan of the lab that you're currently in, and others should listen to you.

I'd like to add that for many projects, general reusable software engineering is not necessarily a huge advantage. Instead of verifying a single implementation, it's often better for somebody to reimplement the idea from scratch; if a second implementation in a different language written by a different programmer gets the same results, this is a much more thorough validation of the software than going over prototype software line by line.

Also, I've seen way too many software engineers come in with an enterprisey attitude of establishing all sorts of crazy infrastructure and get absolutely no work done. If Java is your idea of a good time, it's unlikely that you'll be an effective researcher (though it's not unheard of), because it's not good at maximizing single-programmer output, and not good at maximizing I/O or CPU or string processing. In research it's best to get results, fail fast fast fast, and move on to the next idea. If you're lucky, 1 in 20 will work out. Publish your crap, and if it's a good idea, it will be worth polishing the turd later, but it's better to explore the field then to spend too much time on an uninteresting area.

The only time you worry about efficiency is when it enables a whole other level of analysis. So, for example, UCSC does most of their work in C, including an entire web app and framework written in C, because when they were doing the draft assembly of human genome a decade ago on a small cluster of computers that they scrounged from secretaris' desks over the summer, Perl wouldn't cut it.

Software engineering is important for bioinformatics, in my opinion. But it's important to identify the things that are important and aren't:

Reproducible code: extremely important. Correct code: extremely important. Readable code: very important. Efficient code: often not as important.

Even today, the UCSC Genome Browser is an example where efficient code is important. It is interactive software, has many human users who can work much efficiently when the browser is responsive. And with projects like ENCODE, there are now incredible amounts of data available from the browser that would not be easily possible with a less efficient system.

Very different from an analysis system that will be run a handful of times in batch mode.

>Reproducible code: extremely important. Correct code: extremely important. Readable code: very important. Efficient code: often not as important.

You want Haskell. :)

If Java is your idea of a good time, it's unlikely that you'll be an effective researcher (though it's not unheard of), because it's not good at maximizing single-programmer output, and not good at maximizing I/O or CPU or string processing.

FWIW, I have in the past gotten good results out of Java and C# (it's a lot easier in C#) by writing programs that generate bytecode at runtime, so they can use the JIT to further optimize performance. Getting the same results out of C would require a lot more work. This includes string processing - I wrote a regex compiler for Java at one point, easily outperforming java.util.regex.

And such things are not difficult - or at least, not difficult to me now, knowing all I know - perhaps 10 hours work for simple regex compiler. And that is how I would use tools like Java to optimize my own performance: adapt them to interpret or compile a language that is close to the problem domain. A slightly higher constant cost, with the aim of a much lower per-idea cost.

You describe N-version programming (though not by name). In actuality two different from-scratch implementations are likely to re-make the same mistakes, see the following. http://scholar.google.com/scholar?q=An+Experimental+Evaluati...
> and if it's a good idea, it will be worth polishing the turd later,

Which of the released turds do you consider to be polished?

Pretty much anything that gets used by many people ends up getting polished (the exception being the RNA-seq field, it's still pretty rough out there, but the research is still taking quite a while). And if you're writing software, your tool isn't going to get used until it's somewhat polished, or is so unique and essential in its purpose that people have to use it.

In terms of next-generation sequence analysis, Heng Li's BWA mapper and Samtools libraries are fairly good. His coding style is a bit terse for my tastes, but it keeps out people who don't know what they're doing, it's very clear code for semi-complicated algorithms, and BWA is some of the most reliable software I use everyday.

On the infrastructure side, Galaxy [https://main.g2.bx.psu.edu] is getting fairly good.

The BioConductor repository of R packages is extremely mixed. I don't like some of their architectural choices, but it's ended up working out OK.

I still use Michael Eisen's Cluster from a decade ago, along with Java TreeView.

Regarding samtools, it doesn't sound very good from what I'm hearing:

"Look at the disgusting state of the samtools code base. Many more cycles are being used because people write garbage. For a tool that is intimately tied to research, the absence of associated code commentary and meaningful commit messages is very poor. The code itself is not well self documenting either."

commit log:

http://samtools.svn.sourceforge.net/viewvc/samtools/trunk/sa...

code:

http://samtools.svn.sourceforge.net/viewvc/samtools/trunk/sa...

I can't find that critique with Google.

As I said, the style is very terse, and I have my suspicions that this is by design to minimize the number of less-qualified programmers trying to submit sub-standard code back to the project. (Edit: since it's been 10 minutes and I still can't reply to tomalsky's comment, I should point out that my "suspicions" are a joke; read the linked code sample and judge its quality for yourself.)

I have dived deep into the samtools code, rewritten chunks of file I/O inside it, messed with alternate formats, and my personal experience is that it's been easier for me to change, adapt, and understand it then any other open-source C project I've tried to dive into, such as, say, GNU join.

If anybody can point to where samtools is using many more cycles than it has to, please let me know! The worst part about it is that the compression and decompression is not multithreaded, but that is being worked out, I believe.

samtools is among the better software in sequencing-data analysis. It is also great in defining formally the file format it uses. The same cannot be said of many other tools. For my recent work in RNA-Seq, samtools is the most robust and trustworthy tool I looked at and used, and I looked at almost every popular tool, the major exception being RSEM. If only all bioinformatics tools are more like samtools.
It works just fine. And iterating through files is not rocket science, anyway; it's not hard to follow what is going on.
Well, I can call myself a bioinformatics researcher, I guess, as I have CS Ph.D working in genetics/genomics. I see your point of throwing computers at simple solutions as cheaper than throwing good programmers. I do that too. We are very fortunate in that we write run-once programs that only have to work in one environment using one inputs. However, bad programmers write incorrect programs, which give wrong conclusions that lead to faulty clinical trials (look up Duke University facing class-action law-suit). I have seen people parsing Gigabytes-files with one line of Awk. People seem to forget that good engineering practice is learned with blood. Is it any wonder academic research is looked with suspicion by the pharmaceutical companies?
>I have seen people parsing Gigabytes-files with one line of Awk

I feel exactly the opposite. I'm suspicious of anyone that does not use AWK (or other Unix text utilities) as a standard tool for checking the integrity of multi-gigabyte files, or generating summaries. AWK is super-fast, allows highly flexible checks, and allows quick and reliable interaction with huge amounts of data in the way that a script can not.

I love awk. I once had to search a multi-megabyte hunk of data that was made up of 25-bit data items packed into 32-bit words. Instead of doing bit packing and unpacking, I converted the words into 32 character strings of 1's and 0's. I ended up with a string 300,000,000 (that's three hundred million) characters long!!! Awk had no problems handling it.

To build the string, I had to concatenate 1024 of the 32 characters strings to an intermediate string, and then concatenate these into the final monster, because concatenate just the 32 character strings took too long - a reallocation after every concatenation.

That was fun.

I believe this is an example of the sort of thing that the essay author complained about.

Bit-packing is simple. You spent a lot of time working around problems that shouldn't have existed in the first place. Even when using the approach you described, here is Python code which does what you described:

    >>> byte_to_bits = dict((chr(i), bin(i)[2:].zfill(8)) for i in range(256))
    >>> byte_to_bits["A"]
    '01000001'
    >>> as_bits = "".join(byte_to_bits[c] for c in open("benzotriazole.sdf").read())
    >>> as_bits[:16]
    '0000110100001010'
    >>> chr(int(as_bits[:8], 2))
    '\r'
    >>> chr(int(as_bits[8:16], 2))
    '\n'
    >>> open("benzotriazole.sdf").read(2)
    '\r\n'
This keeps everything in memory, since 300MB is not a lot of memory. If it was in the GB range then I would have written to a file instead of building an in-memory string.

The run-time was small enough that I didn't notice it.

The thing is, you succeeded in solving the problem, and are justly proud of your success. This is how a lot of scientists feel. But a lot of CS people look at the wasted work when there are simpler, better, more maintainable ways.

I wasn't appalled by AWK language but the blind switching of two lines which were assumed to be paired reads when many reads are not paired. It is precisely lack of checking that is a problem. I have nothing against AWK, although personally I use Python for massaging data.
The Duke situation to which you refer was fraud, not just a result of programmer error or poor engineering practice.
It was initially a programming error but the Duke researchers refused to acknowledge it and reanalyze their data, because that might mean retracting their prominent paper. From there it just snowballed. It was certainly fraud after the error was pointed out to them. There might also be other elements of fraud in their paper. I watched a presentation by MD. Anderson researchers who spotted the error and spent more than two years trying to call attention to it.
The smoking gun was an error, but there were something like 9 Potti papers that ended up getting retracted. There's no way that someone could have accidentally made that many mistakes...
Interestingly, the fraudsters were caught because of a false claim on a CV, and that finally destroyed their creditability.

It is intentional fraud, no doubt about it; they restarted halted clinical trials. I was just pointing out they did sloppy work too.

Why is that?
>We are very fortunate in that we write run-once programs that only have to work in one environment using one inputs.

If you work with that mentality, you're asking for trouble. Well, not so much asking for trouble, but sending Trouble a voicemail that says "We're over here, you lazy bastard, just see if you can mess something up!"

That was a tongue-in-cheek thingy. I write extensive tests for all my code. But when I look for a job, people counts papers not weight software quality. It is not easy.
I am a bit clueless here. What is bad about parsing large files with Awk?
Nothing. It is bad to not parse files but still changing content because you assume the file format mandates something when it doesn't.
How can you leverage all of us really good programmers with tons of time, who are dying to work on something "important" and meaningful?
If you want to improve the software engineering quality of bioinformatics software, then find an open source project you are interested in, and offer to submit patches to improve really unsexy but important stuff for bioinformatics user experience. Things like documentation, deployment, user interface, and testing. Some of these things require little domain knowledge but no one wants to do them.

Edited to add: some projects might even have a bug tracker that will already have problems you can tackle.

Where to start? Any list of this projects?

I don't do programming for fun, but I'll be visiting a local university soon, and could share this with students.

Our Genomedata[1] storage format/API should be readily comprehensible, and has a Google Code tracker:

[1] http://noble.gs.washington.edu/proj/genomedata/

[2] http://code.google.com/p/genomedata/issues/list

Got any examples?
Thank you!!
Part of the problem is grant money. Sometimes it's faster to buy more machines and get more results as opposed to rewriting entire algorithms. But the author does correctly identify, I think, some tendencies of some academic bioinformaticists.
I have enough experience to know if this is true or not. Many times it was faster to buy more machine, but often it was not. We already had 10000 cores.

I proposed, implemented, and tested an 8 line change to our alignment tool that saved 6% cpu time. It took me two days, most of which was my spare time at home. This one program was using 15 cpu years every month. Nobody cared. It never went into production. I started interviewing for a new job and left shortly after that.

How complicated was the bureocracy that you couldn't push the change into production yourself after verifying that it is a strict speed-up and doesn't break anything? I think such barriers are incompatible with the word 'research', where the first you need is freedom.
Research is highly competitive business mixed with industry involvement (or government involvement). You have to publish and fast. You have to develop your discoveries into something that can be monetized. You have to collaborate with industry to get funded. You have to cut costs to keep doing what you want to do. And so on. The idea of freedom in (fundamental) research seems long dead. How I long for the freedom in the research labs in the first half of the 20th century. To really explore an idea without regard for cost, returns, (publishable) results. A researcher can dream :-(
> The idea of freedom in (fundamental) research seems long dead

Is it so in the US? Or where? Here in Russia it is far from true, at least in the top institutes. As long as you produce publishable results, you may do virtually whatever you want, and nowadays pretty much anything is publishable. And this way you get funding, too, because the funding agency doesn't seem to want you to solve some particular problem, it just wants to be sure your science keeps up with the world.

The downside here is that the academy usually pays bad. Thus it seems most successful labs work like 70/30 on commercial projects and "pure science". Anyway, when you work on commercial projects you usually get much more interesting results than you'd care to publish.

Here in the Netherlands it is. I assumed it to be the same in the Western world, but those kind of generalizations often turn around to bite me in the ass. We, researcher in the Netherlands, have to produce as funding depends on it. Furthermore, as the government funds less and less, we have to get more funding from industry. And finally we have to try to market our research more. This all means that we can not afford to just do whatever we think is best for the only purpose of extending our knowledge. We have to think about our career and the sustainability of our research (strand) in the long run.

That does not mean that we're just lapdogs for industry or Mammon, but it does mean that we're selective in what we do and how we do it.

Some labs are conservative because they are worried that they will not be able to reproduce the same analysis. For example, imagine that a lab had been collecting samples and executing the current code as they came in. Now imagine, two years later, someone starts drawing some conclussions based on an aggregation of the results over a petabyte of that data. On one hand, you could just say- nope, we cant reproduce the same analysis, but we can use all of our computational power for a month and reanalyze all of the data using the current packages/code. On the other hand, a more conservative idea might be to try to record the entire state of the environment when particular samples are recorded, so that in theory you could replay all that analysis- fire up the vm from 2 years ago, install the same version of all the packages, install the code with the same tags, and analysis that data set, then do the same thing for every other data set. Smaller labs I think are just hoping that no one tries to replicate their studies or asks them if they can reproduce their results.
I dont think the problem is people (researchers, developers) but of the infrastructure for research. Researchers are constantly thinking about getting new grants and renewing old ones the way politicians are constantly worried about their corporate sponsors and getting reelected. The result is that we only get a little science and we only get a little good governance. The internal organizations that form as a result of this environment are artificial. In the lean times researchers make short term decisions aimed at generating marketing and taking mindshare. In the fat times researchers ensure that all computational and lab space are used and come up with new reasons for growth. A friend working in a large research institution once suggested a refactoring that would greatly improve efficiency of an application. Instead, she was handed back down a recommendation that would make the application less efficient with the same functionality. The reason was that the computational usage was about to be audited and the rule was that there would be no improvements in efficiency until after it was complete. The system is an old house with hundred year old plumbing. The people you pour through the system are going to flow through the pipes abiding by the laws of physics. Blaming them for a leak is about as useful as blaming water: while you may win the moral argument, you will not solve the problem. The best you can do is replace them with new people who will react largely in the same manner.