Hacker News new | ask | show | jobs
by bscphil 1327 days ago
> acknowledges the shitshow that is the python library/package/environment management

I'm puzzled by this and wonder if you can provide some examples. The scientists I know tend to have incredibly disorganized R code, with a bunch of hard-coded paths and a single global environment in their home directory that all their R packages get installed to. Even stuff that seems critically important like reproducible science can be much harder than you'd expect in a lot of fields because questions like "what version of the libraries did you use" has to be answered (if it can be answered at all) by looking at the references in the paper.

Whereas in Python, I don't know how things could be any simpler. Creating an individualized environment for your project is one command. Installing packages that only live inside that environment is one `pip install` away. Most scientific work is not "distributed" in the sense of having users, but if you do ship a product to users, Python gives you the option of either relying on distribution provided packages (my preferred approach most of the time) or shipping a single binary created with something like PyInstaller.

6 comments

I've also seen my fair share of garbage R code and I think Gordon Shotwell's comment that "There really are no production languages – only production engineers" speaks to this.[0] A big problem in the scientific community is that scientists aren't trained to write code like production engineers. I don't see it necessarily as being an issue that is endemic to R, though.

Packrat[1] — an RStudio package — can be used to easily avoid the library versioning issues you describe. The problem isn't that the tooling isn't there or that it isn't easy to use. It's that some folks simply don't use it and are perhaps oblivious as to /why/ they should even use it, anyway.

[0] https://shotwell.ca/posts/2019-12-30-why-i-use-r/ [1] https://rstudio.github.io/packrat/

Maybe, but I wonder if it is especially easy to produce horrific code in R. For example, I remember trying to refactor an R codebase that made ample use `load`, leading to all these mysterious variables appearing from nowhere.
packrat is a little old. You want renv, which is the iteration of the same idea and I've found works very simply and nicely.
Also R really didn't play well with conda for a while. It seems to be ironed out in recent years, but I remember the issues of previous years where trying to set up a reproducible R environment in conda was an unreliable endeavour.
I think that has more to do with the fact that most scientists are not trained programmers! Plus a lot of data analysis work doesn't lend itself to the same style of programming IMO.

While there could be more effort in getting things like library versions out there a lot of journals don't care so there's no pressure on scientists to provide it.

Most of the replies to your query have addressed the big issue, which is that "data scientists" are almost universally mediocre (at best) coders. It's endemic to that job position and is guaranteed to be language-invariant.

One factor that isn't helping generations younger than mine (mid-50's) is the continual evolution of tools that remove the user from all the underlying parts. I recently worked with someone who told me they "only know Databricks on Azure" and "don't know python." Their self-assessment was accurate, and the utility of that individual was essentially zero.

The problem with python is that people like myself - non-engineers, and mostly end users of software - spend an inordinate amount of time dealing with mismatched library dependencies, deprecated features, rolling-back python versions to get a working kernel and so on.

The fact that the business model of at least two companies (Enthought and Anaconda) is predicated on the difficulty of getting a functioning python environment to work in this day and age speaks volumes about the problem.

If we can't get past "which pip?," how can we expect the other stuff to "just work?"

If you’re just an end user and a non-engineer, how can you (1) universally judge the level of programming of data scientists to be (2) mediocre?
The SCIENTISTS have very disorganized R code and also on my experience even worse Python code where they learn inheritance and make some absolutely head scratching choices. It makes me weep from time to time.

Here's the thing, programming is a skill. If people think it's the "not important thing" only the result (seen this often in some of my previous positions), you're going to get disasters yeah.

As for package management in R you can use either Renv or conda. Been coding R for a decade and have always pinned down packages and you could do so well before tooling made it simple as pie.

> As for package management in R you can use either Renv or conda. Been coding R for a decade and have always pinned down packages and you could do so well before tooling made it simple as pie.

Right, I get that - but OP was claiming that package management in Python was a "shitshow". It's interesting that a lot of people are responding to my comment by saying "actually you can make package management in R just as easy as in Python, it's just that R programmers tend not to be professionals." Doesn't that just confirm my belief that Python's package management story is actually pretty good?

While I tend to agree with most of the arguments that DS code is usually of low quality, and that DS are not well-trained in good development practices, I wonder if making them better coders is an attainable goal, or even a proper one. My reasons for that questioning: - Data science requires a significant stack of knowledge beyond coding - in fact, to be a useful DS in a company, you already have to learn about maths, business domains, keep up with the latest algorithms, know how to manipulate data, present, run experiments, analyse them, know deep statistics and some others I am probably forgetting. Adding the SW dev skills on top of that and expecting them to become good developers is a tall order, and only a small percentage of the DS community will achieve it. With the level of demand for ML, I don’t know if this will deliver on the market needs - it’s not that it’s not attainable, I think it’s not scalable; - People coming from a SW dev background tend to think DS is the same, just done by people who don’t code well. That is not true: code is the final product of software development, while it is but a tool for reaching the goal of finding a good ML approach for a DS. The consequence here is that SW dev has a much stronger reason for wanting good quality, maintainable code than DS does. When researching for a solution, many iterations of code written by DS will be discarded without ever having to go to production, and I don’t know if the overhead of keeping good tests, structuring the code, making small commits, etc., is justifiable in this scenario - the goal is not to have maintainable code, it is to see if the model+features has potential for solving the problem. - Evolution and maintenance are also a problem, because the structure that’s good for operations doesn’t help the job of research - it’s not common for a DS to work in a pipeline structure (which seems to be the emerging pattern for MLOps), and forcing them to use that structure on all iterations after the first will have significant productivity issues, to the point of putting success at risk;

I don’t have a solution for the points above, and I understand that, once a promising approach has been found, the code starts to matter much more, because Ops will require it to be automated and executed in a reliable way. For now, what I do is to do the research in a very loose way, not caring about good SW practices. When I find something good, I start refactoring the code to meet the Ops expectations. But I’m a CS major with decades of experience in coding and ML - it’s not reasonable to expect the entire DS community to develop the same skills, it takes too long.

Any ideas out there?

I have had similar experience. For me the most annoying was the work they put in making it difficult (nearly impossible) to use with conda environments.