Hacker News new | ask | show | jobs
by hharrison 4598 days ago
This is so true. I'm in a Ph.D. program and everyone around me is wasting so much time by reinventing the wheel every time they need to code something. So I spend my time making libraries to help them out, but then I get scolded because that's time that's not going directly toward getting publications. And few people use my code because they don't trust software as up to the scientific standard unless (a) they spent thousands of dollars on it, a la MATLAB, or (b) they wrote it themselves and, e.g., take a mean by manually iterating over an array, "just to make sure" the mean is calculated correctly. Ugh. It doesn't matter how many tests I an point them to. I can't wait to get out of here and work somewhere where coding is appreciated, where I can actually get paid, and where I have some choice as to which state I live in.
6 comments

I hated the 'publish at all costs' attitude I felt while pursuing my PhD and within a post-doc project. IMHO that leads to the huge amount of trash articles and conferences that is now plaguing academia.
Plus it tends to reward established networks of "friends" who assign each other as coauthors on papers rather than individuals doing the hard part of the work.
I am assuming that you are doing computers science, and in the current environment focusing on the conceptual contribution and do the minimal amount of engineering is solid advice.

I started in physics and there someone could make a great career corroborating for or disproving conceptual contributions. This is not a track in CS and is practically career suicide.

From experience most CS research can not be trusted to be correct, and enabling people to build a career on replicating or corroborating studies would in my opinion be of great value. Even the research that is correct is often not fully implemented so you not only have to implement their approach, you also have to discover how to realize it. That work is not publishable in CS, and it is a non-trivial amount of extremely risky work.

Nope, Psychology with a focus on complex systems, statistical physics, dynamical systems, that sort of thing. Everything from time series analyses that require hundreds of thousands of data points to plain old factorial ANOVAs.

Psychology is probably one of the worst sciences for the attitude described in the article. Being in the most "mathy" corner of the field doesn't really help.

Oh, man. Don't tell them about Kahan summation--they'll freak out and go rewrite everything.
I think I'm one of the "them"--now that I know about it, it seems like a pretty important thing to know. I can't help but wonder, how many other gems like this out there that the "them" don't know about?
Just think, when you get paid your taxes will fund those people re-inventing wheels.
Hey, I have no problems with my taxes going toward science. The way it's done is far from perfect but the answer isn't to take away funding.
Far from perfect, okay. Blatant misuse of funds, not okay.
It's not really a blatant misuse of funds, though. My roommate is an intensely bright dude finishing up his math PhD working with studying interactions between complex systems. He writes all his code in C, and he recompiles it every time he wants to change a variable (e.g. the input file, or the number of iterations).

He's been doing it this way for years because that's what he was taught. That's the level of software engineering acumen you'll get in academia. But it "works". I've offered to help him modify the code so it will accept command line arguments, and we're going to sit down and do that so he can run several instances in parallel and utilize all of those fancypants cores on the computer I loaned him, but... he didn't know you could do that. No one told him! How would he know where to start looking that up? How reasonable is it to expect him to grok all that, when he's deep in math-land?

So it was blatant to me, software developer of four years, that something was pretty wrong, but for him: he's about to finish his PhD. He's been published a couple of times. They're not running horribly inept software development, they're running mathematics the best way they know how.

Yeah, libraries aren't a good way to start because there's not enough interest in using them.

There are opportunities to build standalone tools which blow away their predecessors by multiple orders of magnitude, though; after getting enough researchers to use one such tool, you might attract sustained curiosity from a few people wondering "how the hell did s/he do that?!" and organically grow a small library with a real user base. That's one of my own long term goals, anyway.

Well I'm self-taught so I have to start somewhere. I'm not sure I could put together a stand-alone tool and still complete my Ph.D. program. Anyways I've found most stand-alone tools just aren't flexible enough and I don't feel like making something I wouldn't use myself.
Fair enough, and definitely agree with not making something you wouldn't use. (The "most [existing] stand-alone tools aren't flexible enough" problem is, however, one of the reasons why there's so much room to do better...)
True that! Okay, you've convinced me to make it a long-term goal.
How would one take a mean of n elements without visiting all n elements? Won't the memory bandwidth and big-O complexity always be the same? Genuinely curious.
The language used in MATLAB and Octave is designed for vector processing to an extent most developers haven't seen before. MATLAB doesn't mean "Math Laboratory", it means "Matrix Laboratory." Operations on row and column vectors are first-class language elements. You almost never have to manually iterate over an array to compute its statistics -- you'd just say M = mean(A [,dim]) where A is a standalone vector or a column vector of a matrix. In that example, M itself is a vector, if A was a matrix.

MATLAB syntax is ugly but the underlying principles are pretty cool. Well-written code scales automatically on newer hardware, or at least it has the potential to. That's not true in languages where higher-order vectors are built from discrete scalars.

The good stuff of Matlab must be balanced by it's perverse, pathological and obscene qualities.

The most vile aspect of Matlab is the faith every researcher has that producing something in Matlab is enough when the reality is code coming from Matlab will never escape, will never be as useful nakin-style pseudo for the creation of any larger system.

In MATLAB, R, or numpy, it's the difference between `mean(n)` and manually looping. It's not an issue of algorithmic efficiency, it's an issue of lost productivity because they don't even write a function to reuse (all they understand is scripting) they recode the loop every single time they have to sum or take the mean of something.
Well it is because NumPy and friends do all the heavy-lifting in hand-tuned C. dis() your Python function for taking a mean and see the difference, it's huge.
The point is not computational time; the point is that one could simply call an existing library function rather than hand-coding the loop oneself and risking making an error (a fencepost error, for example).
I can understand that you'd want to manually check what's happening. For example taking the mean over the rows of a 2D array using numpy's mean function and aren't really sure whether axis=0 or axis=1 refers to the rows.

But you'd only have to figure it out once and then learn to trust numpy, instead of rolling your own version every time.

You missed these key words: "manually iterating"

So looping in a high-level language rather than using vectorized functions.

It's probably more in reference to the layer the work is completed in. I haven't used matlab in years but you can probably sum an array by iterating or you can call a faster more efficient library. You get much greater gains when doing this in higher dimensions. If you can do your operations at a matrix level you get a magnitude improvement in speed in most languages.
I think the concern is over the manual component of it, especially if that set of n is big by human standards. (Say, doublechecking a few hundred entries of some column entry by calculator.)