Hacker News new | ask | show | jobs
by east2west 4889 days ago
Well, I can call myself a bioinformatics researcher, I guess, as I have CS Ph.D working in genetics/genomics. I see your point of throwing computers at simple solutions as cheaper than throwing good programmers. I do that too. We are very fortunate in that we write run-once programs that only have to work in one environment using one inputs. However, bad programmers write incorrect programs, which give wrong conclusions that lead to faulty clinical trials (look up Duke University facing class-action law-suit). I have seen people parsing Gigabytes-files with one line of Awk. People seem to forget that good engineering practice is learned with blood. Is it any wonder academic research is looked with suspicion by the pharmaceutical companies?
4 comments

>I have seen people parsing Gigabytes-files with one line of Awk

I feel exactly the opposite. I'm suspicious of anyone that does not use AWK (or other Unix text utilities) as a standard tool for checking the integrity of multi-gigabyte files, or generating summaries. AWK is super-fast, allows highly flexible checks, and allows quick and reliable interaction with huge amounts of data in the way that a script can not.

I love awk. I once had to search a multi-megabyte hunk of data that was made up of 25-bit data items packed into 32-bit words. Instead of doing bit packing and unpacking, I converted the words into 32 character strings of 1's and 0's. I ended up with a string 300,000,000 (that's three hundred million) characters long!!! Awk had no problems handling it.

To build the string, I had to concatenate 1024 of the 32 characters strings to an intermediate string, and then concatenate these into the final monster, because concatenate just the 32 character strings took too long - a reallocation after every concatenation.

That was fun.

I believe this is an example of the sort of thing that the essay author complained about.

Bit-packing is simple. You spent a lot of time working around problems that shouldn't have existed in the first place. Even when using the approach you described, here is Python code which does what you described:

    >>> byte_to_bits = dict((chr(i), bin(i)[2:].zfill(8)) for i in range(256))
    >>> byte_to_bits["A"]
    '01000001'
    >>> as_bits = "".join(byte_to_bits[c] for c in open("benzotriazole.sdf").read())
    >>> as_bits[:16]
    '0000110100001010'
    >>> chr(int(as_bits[:8], 2))
    '\r'
    >>> chr(int(as_bits[8:16], 2))
    '\n'
    >>> open("benzotriazole.sdf").read(2)
    '\r\n'
This keeps everything in memory, since 300MB is not a lot of memory. If it was in the GB range then I would have written to a file instead of building an in-memory string.

The run-time was small enough that I didn't notice it.

The thing is, you succeeded in solving the problem, and are justly proud of your success. This is how a lot of scientists feel. But a lot of CS people look at the wasted work when there are simpler, better, more maintainable ways.

I wasn't appalled by AWK language but the blind switching of two lines which were assumed to be paired reads when many reads are not paired. It is precisely lack of checking that is a problem. I have nothing against AWK, although personally I use Python for massaging data.
The Duke situation to which you refer was fraud, not just a result of programmer error or poor engineering practice.
It was initially a programming error but the Duke researchers refused to acknowledge it and reanalyze their data, because that might mean retracting their prominent paper. From there it just snowballed. It was certainly fraud after the error was pointed out to them. There might also be other elements of fraud in their paper. I watched a presentation by MD. Anderson researchers who spotted the error and spent more than two years trying to call attention to it.
The smoking gun was an error, but there were something like 9 Potti papers that ended up getting retracted. There's no way that someone could have accidentally made that many mistakes...
Interestingly, the fraudsters were caught because of a false claim on a CV, and that finally destroyed their creditability.

It is intentional fraud, no doubt about it; they restarted halted clinical trials. I was just pointing out they did sloppy work too.

They were caught due to their bad behavior in the case you listed earlier. But Duke refused to do anything about it until Anil Potti's false claim of a Rhodes Scholarship came to light.
Wow, that's messed up.

Fundamental methodological error -> "Come on, these are competent people, you have to trust that whatever error they made didn't effect the final result."

False claim of accolade -> "How dare you fucking try to pass off this garbage as legitimate science?!?!?"

Welcome to academia
Why is that?
>We are very fortunate in that we write run-once programs that only have to work in one environment using one inputs.

If you work with that mentality, you're asking for trouble. Well, not so much asking for trouble, but sending Trouble a voicemail that says "We're over here, you lazy bastard, just see if you can mess something up!"

That was a tongue-in-cheek thingy. I write extensive tests for all my code. But when I look for a job, people counts papers not weight software quality. It is not easy.
I am a bit clueless here. What is bad about parsing large files with Awk?
Nothing. It is bad to not parse files but still changing content because you assume the file format mandates something when it doesn't.