Hacker News new | ask | show | jobs
Ask HN: Why is Python so popular for ML/DS?
20 points by arialeks 3072 days ago
I'd like to start learning ML and it seems like Python (and R) is the default choice. But why? Why aren't languages like C#/Go/Java relevant in that area, or at least a compelling option?
12 comments

Java, R, Python, C and C++ are massive in machine learning and Data Science. The reason Python is at the top is because it has the largest set of machine learning and data analysis tools, tutorials, easy setup and easiest use-ability and extremely minimal to no boiler plate needed to get results.
Python was big in scientific computing before ML went mainstream. In college we were taught using matlab Mathematica and sci-py. So there were a lot of stats and science libraries available in python. Academics typically don’t care about software engineering or typing so python is an easy approachable choice.
Add ipython (now Jupiter) notebooks to this for easy to publish analyses.
Prototyping is what you usually do in ML/DS. You code incrementally and break things really fast.

Python (IPython/Jupyter) and R's interactive mode is really indispensable for this. Without restarting/re-initialize every variables and modules from beginning, you can run code snippet, run, edit part of code, run, add some code, plot something, run, repeat...

This + gazillion of optimized math functions and algos are available right out of the box for all platforms.
Anecdotally, I have programmed in python, js, java, c#, go, C, C++, ruby, and php all in a professional environment and I'd have to say I prefer python the most. It is a very logically made language with a nice balance of abstraction and expressibility. My general language choosing path is: can I do it in python? If so, use python. Obviously you need to drop down to lower level languages for certain cases, but why work with memory management for application code if you can help it? I guess my point is that if you're going to be learning ML, you should be thrilled that you have the option of using python. That said if other languages float your boat, I've certainly done some ML in c++ with openCV and it was a positive experience. Use whatever you want.
Python also gives you the advantage to write faster code in C for smaller more computationally intensive parts of the code and use the C code with python. This allows you to dip into low level languages like you stated but also keep all the main functionality in python.
It's just where a lot of the tools are. And why all the tool are in the python camp is probably b/c the academic/scientific community adopted the python ecosystem at a high rate.
And many in scientific industries came from tools like Matlab backed by heavy enterprise support contracts. Python was the first real foray into open source (FOSS) for many of them.

Some adopted R as well but it just didn't take off the same way even though it really was/is a better fit for some use cases.

If the question is historical, it is because Python is easy to learn, easy to read, interfacing with C/C++ is easy to get existing ML/DS code, and it was one of the first scripting languages to handle huge numbers which is important in science.

If the question is about now, Python has lots of libraries to help in addition to the previously mentioned reasons. However Java still has better library support for NLP overall such as OpenNLP, Stanford CoreNLP, etc. NLP in Python is catching up though thanks to gensim, spacy, etc.

The main reasons Python is so popular in science (ML and before) is:

1. It's easier to get started than Java, Go, etc.

2. It's faster to write/prototype. (The IPython REPL and Jupyter notebooks are awesome.)

3. The Python community is also very open source friendly and has significant momentum in third party packages like pandas, numpy, scipy, etc.

Check out some talks from past PyCons and you will see a very strong scientific presence more so than the other languages you mentioned.

Python stands on the shoulders of the giants, namely MPI/MKL/BLAS/LAPACK, which was the core of numpy, scipy and sympy. Maybe even SageMath.

If you are developing ML/DS models, a REPL environment comes very handy, there is IPython.

You'll find a large chunk of your time is spent on gathering, cleaning data, which Python really excels.

Make a website that integrates or visualize the data with Python? Sure

You can not find a second language this versatile and has a well supported ecosystem.

It's the easiest language to learn from the list you described. C#/Go/Java are statically typed and compiled, which creates a barrier for learning. R is too complicated and showing it's age. Python is a sweet spot.

Data scientists aren't interested in learning industrial programming languages (C#, Java and maybe Go), they just want to do their job, and it doesn't require industrial language.

I'd say because Python is a good prototyping language that has easy interfacing to the underlying C/C++ code used for fast numeric computation.
Going from raw data to pandas driven analysis is pretty seamless.

I can use our production code for heavy analytics or modeling, and just as easily take the data to my laptop Python interpreter in memory.

To be honest, I don’t understand its popularity.

For example, I was trying to understand a batch normalization function defined as

def batchnorm_forward(x, gamma, beta, eps):

I can’t tell if gamma/beta are scalar or vector?

We have type annotations in python now which can be used to clarify data types in function signatures.

https://docs.python.org/3/library/typing.html

Wouldn't the default assumption be that you can use either a scalar or vector?
You could try np.isscalar() and np.array().ndim in NumPy.