Hacker News new | ask | show | jobs
by mattheww 4424 days ago
I am a scientist, and I have seen a lot of terrible code. Most scientists have no formal training in computer science or coding. Many advisors don't place much value in having their grad students take such classes, though even a short language-specific introduction class would vastly improve their students' productivity.

I recently undertook a complete rewrite of our group's analysis software that was written by our previous postdoc. It was ~30k lines of code in 2 files (one header, one source file), with pretty much every bad coding practice you can image. It was so complicated that that postdoc was essentially the only one who could make changes and add features.

The rewritten framework is only ~6k lines of code to replicate the exact same functionality. It's easy enough to use that just by following some examples, the grad students have been able to do implement studies in a couple days that took weeks in the old framework. The holy grail is for it to be easy enough for the faculty to use, but that will probably take a dedicated tutorial.

My point is that following "best practices" may be overkill, but taking a thoughtful approach to the design of the software can vastly improve your productivity in the long run. Posts like the OP help scientists who write bad code defend poor practices. Any scientist worth his salt should support following good practices because it will always lead to better science.

8 comments

I work in R&D for a large science services company. And, I'm often responsible to turn nifty research projects into marketable products. Because of this, I often take over a lot of code from scientists and academics. And, it's usually (e.g. always) pretty bad.

'Software engineers' get a bad rap for over-engineering code. And, I understand that. But, the opposite is so, so much worse. I see what you're describing every time I take over a project.

The worst characteristic though is lack of version control. Usually these teams will have used email to exchange source files. They usually have a directories full of 'version_X' sub-directories of different code. And, usually each member of the team will have different versions of the code.

The second worst characteristic I find is code that doesn't actually work unless it is placed exactly in the right directory of a now non-existent server. They send me code (in a zip file, of course), no instructions, no configuration. And, then I spend several days or even weeks just trying to get it to work the way that they said it worked back at their research 'demo' a year ago. 'It worked last year', they say. And, then imply that I'm some sort of hack because I can't understand what they're doing.

I'm a scientist who does lots of code. Most of my "projects" are 1000 lines or less (usually much less) to do a single function or calculation.

Last year I was pulled into my first larger-scale project (about 8 science coders at multiple institutions over 5 years). We were able to produce reasonable, readable code for each other on a file-by-file basis. But Version Control was the worst, worst part. Files emailed back and forth between subgroups that never made it into the tree, edits lost, we all had our own forked version at the end, essentially.

The most telling part was when I emailed both IT in my department and several professors (PIs) on the project, including those that taught "scientific programming", asking about setting up a source repository, if one of them could host one, and NONE of them had any clue what git, subversion, etc. even were, let alone where/how to set something up.

You could set up a private BitBucket repo and simply give them the link to a .zip download, while you would enter any code you receive into the repo. It might be unfair that you would have to do all the version control, but it's better than nothing...
At one company I worked at, we had EVCS. "Eliott Version Control System." Everyone emailed Eliott change sets and he put them together.
If you can say, what company? It sounds like a pretty interesting role- despite the frustration and difficulty of dealing with such code, turning that into something more generally useful/useable seems like it would be relatively fulfilling in the end.
I wouldn't say the opposite is so much worse, rather it is whichever annoyance you deal with is worse than the one you don't.

Also, there is a difference between over engineered and not engineered. It is truly the "over engineered" that has me annoyed nowdays.

>The second worst characteristic I find is code that doesn't actually work unless it is placed exactly in the right directory of a now non-existent server. They send me code (in a zip file, of course), no instructions, no configuration. And, then I spend several days or even weeks just trying to get it to work the way that they said it worked back at their research 'demo' a year ago. 'It worked last year', they say. And, then imply that I'm some sort of hack because I can't understand what they're doing.

As a graduate student who has had to deal with this kind of code, and finally joined together with another grad-student to fight back and make our software retargettable... I'm so, so sorry.

This is true in finance and in other data-heavy fields as well. I've been shocked at the kinds of Excel sheets that, with a mess of spaghetti VB code written by someone long gone, factors into trades worth millions...sure, it "works"...but besides the very minor question of code elegance, who knows what optimization of returns could be made if the code wasn't such a fright that a knowledgable partner could tweak and experiment with it? Or that it was abstracted enough to be applied to the other kinds of trades that the firm is making (but hell what do I know, I'm not as rich as my hedge fund friends)?

What's particularly annoying is working with analysts who have a system of pasting SQL scripts from a (hand-labeled-versioned) text file to perform the necessary data-munging/pivoting for in-house use...their SQL work is, to be fair, so much of a leap forward from however such bulk data work was being done previously that they take offense when I offer to help them automate the work...as if their system of hand-pasting/executing scripts, then eyeballing the results for an hour to spot-check it, was inherently more reliable than a batch script with well-defined automated test parameters...What they fail to see is that it's not just about faster/better error-checking, but it's about more flexible analysis and output. Once the process has been abstracted, instead of producing one "clean" giant database that is faceted along one dimension (time, perhaps), the script can loop through and spit out a variety of useful permutaitons, which would be impossible/insanity if you stick with the hand-tweaked process.

That's the problem I see with the OP...A scientist can recognize when something seems to work, when it comes to the domain of programming and structure, but "what works" may simply be "what seems to work better than what I did last time"...which is not a foolproof standard of evaluation

I'm a programmer and I've worked with scientists (planetary geology). The code is usually pretty bad, but ignoring how "pretty" or maintainable it might be, from the outside it ran way too slow, used too much memory, and botched edge conditions. On the good side, the intentions were pretty clear and the mathematics were sound. So it was pretty easy to fix things up to handle needed data volume and deal with the missed edge cases. As long as I was brought in within a certain window of time it was easy indeed.

The real issue is not best practices, per se, but what passes for them in some rather large circles. Yosefk's "DriverController, ControllerManager, DriverManager, ManagerController, controlDriver ad infinitum" is a fine warning sign. Nothing there is named after anything in the problem domain, and that's a sure sign of trouble. It's a sign that the programmer thinks the problem domain is software engineering or computer science, but that's wrong.

I've always seen becoming intimate with the problem domain as an integral part of programmimg in the real world. I've succeeded to the extent that I have been occasionally asked to provide help outside of software, by top people. How can anyone do a good job providing software solutions otherwise?

The question is (from direct experience): how long did it take you, and what was the cost to your career in terms of papers you didn't publish, research you didn't do, etc.?

It took me far too long to realize that there's almost no reward for code quality in academia. Code rarely gets re-used. Of the small amount that does, result consistency is a higher priority than maintainability, except for the .0001% of projects that end up being maintained by a large, collaborative team. So if you're the sucker who spends 30% of his time cleaning up the old code, you're at a 30% disadvantage to the people on the team who will quite happily use your work to publish papers, get postdocs/professorships and succeed.

I'm being a little harsh, but not by much. Unless you're tenured faculty, publishing is job one. The same rule applies to startups: code quality doesn't matter until you're successful, and once you're successful, someone else will be maintaining the code. The costs of badness are externalized to those who will voluntarily bear the burden.

I think you've hit the nail on the head. Scientists are not there to create great software. They are there to create great science. For the small amount of software that does end up in a commercial product, it will probably be rewritten anyway, and probably by somebody who wasn't doing the research in the first place.
So it "saves time for research" in the sense that scientists don't check that the code component operates correctly? In that case, why bother with code at all? Just make up plausible output and no one will look any further.
You're making an invalid assumption: "nice code" is neither a necessary nor sufficient condition for "correct code".
Where did I make that assumption? I said that scientists aren't checking that the results of the code are in fact correct.
If you're concluding that from what I said, then you're making the assumption. Bad code can still be well-tested.
Sorry, I wasn't clear: I meant that the reviewers aren't checking that the code is running correctly.

And yes, I'm sure scientists do a bang-up job testing their own code, just like they do a bang-up job validating their own experience, checking their own logic, and criticizing their own experiments.

But the whole point of science is not to trust yourself; to make reproducible what you did. To the extent that you seal off part of the process from this kind of review, you're not doing science, but something else.

I think it depends on whether it needs to be maintained over a period of time or if multiple people need to work on the codebase. If it's just being written for one paper then sure, just get it done as quickly as possible.

However, there's no reason not to follow some best practices. Using a VCS has pretty much no cost other than some initial learning curve, and the productivity benefits can be substantial. So - I think there's a balancing act in terms of optimal speed between writing good code and writing code as fast as possible.

I hate "best practices", precisely because it implies there is one (and only one) "best" way to do something, and it's usually implied that there is only one tool that does things that way. That being said, I can see why "best practices" have come into being.

Like the article author, I too have worked on code created by physicists, mathematicians and yes, even electrical engineers. The article author is lucky; "bad" coding practices I've come across include:

- create a new directory, copy the files you want to change into the directory, then make new changes - that's version control! (nb - no, they didn't name anything to indicate which was the new "version").

- constructors with (I shit you not), 29 arguments, none of them defaulted. Of course, that was because it was converted from Matlab code where the original functions had 30 arguments . . .

- etc, etc, etc

I'll tell you what; give me your paper, and I'll implement the code from that much better than you ever could. Sure, I've had plenty of experience cleaning up other people's messes ("we've got this standalone RADAR sim written in Matlab; it should be quick and easy for you to convert to C++ and interact with a two other sims!"), which is precisely why I don't do it anymore. Or at least, I'll have a look and give you a better estimate than I used to, but I'll be honest and also quote you a much shorter time to re-write it from scratch.

> "taking a thoughtful approach to the design of the software can vastly improve your productivity in the long run"

I think, taking a "thoughtful approach" is the key to a lot of different practices. "Best practice" as used by most people, in many different crafts and arts, is a method to avoid thinking on what it is you are trying to do.

The most effective kinds of "best practice" are the ones you mastered by making a lot of mistakes, not something you pulled out from a book or a class. It is naive to think you can substitute standards for personal mastery.

I've waded through a lot of legacy and current scientific code (and still do that sometimes).

The worst part (not taking into account the coding style per se) for me was the (sometimes) inability to reuse the code I've encountered or adapt it to other cases.

I think scientific advisors should make a point which goes something like "If you're serious about your work, you might find one day that someone else wants to use parts of your code, so take that into account when planning your program". In my experience, a lot of programs are written as quick-hack solutions, and then there is no time to rewrite them, they grow bigger and it just snowballs from there.

The way CS was taught to us (and we're a big university) was pretty bad. No coding style, no experience with CVS, nothing concerning planning before writing new code. In the end, a lot of people got the bare minimum amount of knowledge needed to code, and started doing research using that knowledge.

I agree of course, I just think a scientist taking a more thoughtful approach > a scientist taking a sloppy approach > a "software engineer" taking an overly thoughtful approach. Because the latter could have written ~200K LOC spread in 5 directories and you'd need a debugger to tell which piece of code calls which.
I think you're comparing apples to oranges, both here and repeatedly in your original article.

For one thing, you describe many "sins" that "software engineers" commit, but in reality code that was flawed in most of those ways would not even have passed review and made it into the VCS at a lot of software shops, nor would any serious undergrad CS or SE course advocate using those practices as indiscriminately as you seem to be suggesting.

For another thing, how many "scientists taking a sloppy approach" do you actually know who can successfully build the equivalent of a ~200K LOC project at all, even if those 200K lines were over-engineered, over-abstract code that could have been done in 50K or 100K lines by better developers? It's one thing to say a scientist writing a one-page script to run some data through an analysis library and chart the output can get by without much programming skill, but something else to suggest that the guy building the analysis library itself could.

It's not that a single scientist writes it, but rather that someone publishes a paper on something, with ugly code used to prove it, and then becomes a professor. Subsequent generations of graduate students are tasked with extending / improving this existing codebase until it is basically Cthulu in C form. ;)

I recall reading a propulsion simulation's code developed in this way. "Written" in C++, initially by automated translation of the original Fortran code. Successive generations of graduate students had grafted on bits of stuff, but the core was basically translated Fortran, with a generous helping of cut-and-paste rather than methods for many things. (I don't mean this as an insult to Fortran: I've tremendous respect for its capabilities, and have read well-written code in that as well.)

The net result was that fixing bugs in the system was very challenging, as it was a very brittle black box. It was not Daily-WTF-worthy, but still very frightening. I'm very grateful I was not the one maintaining it. ;)

You must not have been in science or you'd have encountered the 200K LOC program, written in five programming languages (two of them obscure), which can only be compiled on the author's computer. Oh, and add 50K of C code from ancient versions of other projects (which could've been used as libraries) for undocumented reasons.

Though, I have also had colleagues who were also brilliant programmers.

This describes almost every published application I have ever tried to get running. It ends up being impossible to get the application working on anything other than the authors workstation.
I would alter your list to say that a competent software engineer working together with a scientist > a scientist taking a thoughtful approach > a sloppy scientist > someone who is neither a competent software engineer nor a thoughtful scientist.

From the article and your comment above, it sounds to me like you have had to work with a terrible programmer who ranted about best practices to cover for his incompetence. We've all worked with someone like that, even in software shops. Don't tar us all with that brush.

I think it's a pretty shoddy software engineer who writes more LOC than the scientist. Good code is concise, readable without comments, etc. Bad software engineers write bad code is no different than a bad scientist reasoning that the sun is cold because the temperature in January is below freezing.
What's really interesting here is comparing the two lists of problems the author gives.

On one hand, the problems are either product defects (crashes, missing files, etc.) or maintainability defects (globals, bad names, obscure clever libraries, etc.).

On the other hand, the problems the author mentions are basically things anathema to snowflake programmers (files spread all over, deep hierarchies, "grep-defeating techniques", etc.)

The academic's code scales vertically, because you can always (hah!) find some really bright researcher who is smart enough to grok the code and spend all the time in valgrind and whatnot to make it work. However, God help you if you can't find (or, more appropriately given the current academic culture, force) somebody to waste many hours of their lives fixing mudball code.

The other extreme scales horizontally, right? You have these many files, and deep hierarchies, and dynamic loading, but that's how a lot of people are used to doing it and that's what the tooling is designed to support. The big accomplishment of Java and C# isn't that it lets you get a 100x return from a 50x programmer, but that it lets you scale to having 50-100 programmers in a semi-reasonable way on a project.

In an ideal world, you have a small number of academics and engineers that communicate tightly and write good, compact, and clean code; in the real world, you want to pick tools that help you deal with the fact that it is hard to scale vertically.

EDIT:

At second read-through, I think the author just needs to use better tools. A good IDE makes code discovery much easier than mere grep, and helps solve a lot of other problems.

I do not understand the insistence of academics on using unfriendly tools.

> I do not understand the insistence of academics on using unfriendly tools.

My step father teaches doctorate business students. Until VERY recently he was running Corel Wordperfect simply because it was the first word processor he had installed. Never underestimate the potential stubbornness of smart people :)