Hacker News new | ask | show | jobs
by Thriptic 2966 days ago
Software development infrastructure in a box plus strong training materials / guidance for scientists. Increasing numbers of scientists are writing code without any formal CS training, and the outputs are predictably awful and unreliable. It is very common to find no testing, no acceptance criteria, no version control, no formal planning, no code review, no style guide being employed, sparse commenting, fragmented development environments / dependency hell etc. People frequently know that what they are doing is suboptimal, but it is hard to convince them that they should put in the work to use industry best practices for a variety of reasons.

If someone could create a product (probably infrastructure plus a Python IDE) which made doing things the "right way" easy for these users, and which would provide case studies or tutorials to show them WHY doing things correctly is beneficial using analogies to good lab behavior, it would be hugely valuable.

6 comments

I'll give an example: A friend is a PhD candidate in a neuro lab. He uses Matlab almost exclusively and the undergrad was in neuro as well. He has never taken Clac, let alone any matrix algebra. Ask him what det() means, and it's a deer in headlights. That said, he's written about 10k lines of matlab code, with 15 levels deep of nested for() and while() loops. To say he needs help is known, as he paid me to help him out for 16 hours the other month. Literally, it was impossible. I could barely wrap my head about his problem, let alone the code. Remember the saying : 'Don't do data science in a GUI' ? Well, his issues were a good argument for that. Trying to get what he was doing in a python IDE would not be possible, it's just too complicated.
I once worked for a company where the accountants had written a massive and convoluted VBA application that had gradually become mission critical to the company.

The IT department allocated a number of people (half a dozen?) for some time (8-12 months?) trying to turn it into a maintainable software product.

They failed.

I suggest you look into the Software Carpentry foundation - they are making a great effort to improve scientific programming through short workshops: https://software-carpentry.org/
Ironic that scientists are not conducting any testing.
I routinely bring this up actually. Testing is literally equivalent to running experimental controls -_-
I'm doing this - and have already made good progress. Contact me at wtpayne-at-gmail.com it you want to talk about it more.
i can't see a business model - the scientist is already paying with their time to get something working that is just good enough for publication. all the problems you describe are down the road and scientists kick that to 'industry'. unless your 'right way' is also easier to boot, why would they add to their time?
I think that is indeed the perception, but that the time savings would actually be quite pronounced if people did things correctly. I've seen many projects get immediately bogged down by bugs / feature creep / lack of planning and end up taking far longer than if people had done things correctly. Also, many labs hand off code bases when post docs or students leave, creating chaos for the next person that is tasked with working on them.

As an example, I wrote a proof of concept script to show that we could automate some basic image analysis in my lab three years ago. That was immediately grabbed by an investigator and put into production without any further thought. Because it was a proof of concept script, it was of course very buggy and required substantial feature addition. This was added without any thought for design etc. Fast forward to today and this code base is a sprawling shit show which is being rewritten for the THIRD TIME. Each time has ended in failure because people failed to observe basic best practice, and this attempt will likely fail too. That is an ENORMOUS waste of investigator time. Another project I can think of involved a model which had a 10,000 line function. No one could trust what was being outputted by the thing, so they eventually abandoned it. That's hundreds of investigator hours down the drain.

I agree 100%. This is something I've taken to heart after seeing and trudging through academic code over the last few years.

In a way I also think this is a language problem. I hope that for some data-intensive projects productive statically typed languages (aka Swift + Tensorflow + Python interop) can help fix this.