Hacker News new | ask | show | jobs
by trevz 3045 days ago
A couple of thoughts, off the top of my head:

Programming languages:

  - python (for general purpose programming)
  - R (for statistics)
  - bash (for cleaning up files)
  - SQL (for querying databases)
Tools:

  - Pandas (for Python)
  - RStudio (for R)
  - Postgres (for SQL)
  - Excel (the format your customers will want ;-) )
Libraries:

  - SciPy (ecosystem for scientific computing)
  - NLTK (for natural language)
  - D3.js (for rendering results online)
5 comments

I make the claim that you can go very far in the SciPy ecosystem without ever touching R.

It is worth understanding the concepts of numpy and pandas. Furthermore, try out IPython/Jupyter, especially for rapid publishing (people run their blogs on jupyter notebooks).

I think certain libraries depend very much on where you focus. Machine learning? Native language processing? Visualization? Something in economics? Fundamental sciences? For instance, I never need NLTK in theoretical astrophysics ;-) Instead, I need powerful GPU based visualization, which is however very old school with VTK and Visit/Amira/Paraview (also very much pythonic).

I disagree, even though python is the language I do most of my development in. But it probably depends on the problems we're thinking of a data scientist solving.

If you're doing a lot of work with matrices, model fitting in production, then python seems fine. However, a lot of data scientists I see are more like scrappy data analysis / visualization types, who are churning out small dashboards. In that case R's tidy verse and shiny are just incredibly fast to develop with.

I second that R is nice to have, but not needed. I’ve been doing science in Python for a decade without ever needing R.

For powerful GPU viz, have you considered vispy? Four authors of four independent Python science visualization libs got together to build it.

Agree, I would drop R, Python has you mostly covered now. Julia is also worth learning.
I wouldn't be recommending to drop R at all.

Very few enterprise data science teams are 100% Python (in fact none I've heard of). R is still very heavily used (and in fact all data science teams I've worked in it has been the dominant technology).

There is a reason Microsoft purchased Revolution.

R, python and Julia are all Turing-complete languages, so of course you can drop any two and get by with just the third.

The real selection happens when you consider what's available in opensource world. What code you don't have to write? What high-quality libraries are available vs which ones you will have to write yourself?

On this topic, R has vast advantage over python in some domains, such as bioinformatics for example, while python definitely shines when it comes to deep learning (and using for loops).

You can't just claim that one shouldn't look at R because you personally know one language better the other, quite likely because in your domain it's not being used as much.

I do prefer the deep learnin, NLP and production serving story in python, but you will have to pry dplyr+ggplot from my cold dead hands for quick analysis and charting. Not to mention that pandas's API is a clusterfuck compared to R's native data frames.

Maybe SpaCy for NLP. Way more intuitive and fast too. Good list.
Most of these are conveniently packaged in:

$ docker run -it --rm -p 8888:8888 jupyter/datascience-notebook

I'd gently suggest basic CLI Perl over BASH for cleaning up files, as it combines grep/sed/awk in a language thats more generally useful.
Agreed. Perl was designed for text munging, and is superior to pretty much everything for this task.

WRT bash, where to begin? In the past 40 years, there’s pretty much a better tool for everything someone tries to do with bash. It lives on pretty much through inertia and pride.

FreeBSD sh(1) (not bash(1)) man page. That's just how I understood how to shell. Nowadays I'm running Debian and my $SHELL is /bin/bash, but when I was on FreeBSD I really learnt tools like make(1), sh(1); the man pages were pieces of art. Having read sh(1), I do have a nice grasp of how shell works in general, to which knowledge I can add anytime the higher-level goodies bash has to offer (though I generally prefer keeping it POSIX, and using an actual programming language when it doesn't cut it).
good list. I would add tidyverse in R ecosystem to it
I would go as far as saying the tidyverse is an essential piece of working with R. Base R sans tidyverse is not a pleasant experience.
It's not that bad. It's inconsistent and clunky, but all of the tools are there (and tend to be faster than the tidyverse versions). Don't get me wrong, I love the tidyverse but R is very, very usable without it.