Hacker News new | ask | show | jobs
by axchizhov 858 days ago
So, instead of printing a pretty plot in 2 lines of code, you will be... making these 20k syscalls yourself?

You have such unusual hobby, my friend!

3 comments

It doesn’t take 20k syscalls to print a plot, the 20k syscalls is for the import call. I would hope that drawing plots takes a lot less.

To engage with your point: loading a dynamic library in a regular language takes significantly less than 20k syscalls. Probably 20-40 for C on Linux. Python is uniquely inefficient. On most plots comparing resource use by different languages, in order to even show python together with regular languages like Java and C, either you use the log scale, or everything but Python is shown as a single point.

Of course, most people use Python to glue together stuff written in C, so it’s not that big of a deal, but it becomes a problem when people forget pure Python code is literally hundreds or thousands times slower than a “regular” program doing the same thing.

Well, duh. Seaborn is a plotting lib for research — you will probably make a couple of hundred calls to it in a week. In a week. After that, you will save your figures for the report and forget about your code. It's plainly obvious that you don't use it in a high load production scenario.

I just don't see how a person could spend 20 years using python and still can't figure out that you shouldn't hammer nails with a microscope.

Modularity and customization come at a cost. Python is the systemd of computer languages. But it is not trying to sell itself under the KISS banner.
Oh absolutely, for some tasks Python is amazing. I use Jupyter notebooks a lot, for example, and the flexibility is an incredible feature.

It just worries me when I sometimes see those same Jupyter notebooks running in production, crunching 100s of terabytes of data. Maybe I’m wrong, but I didn’t get the impression everyone realizes exactly how wasteful that is. I guess AWS credits are easy to come by.

One thing Google did well back in the day, was making resource costs report in SWE/hours, the idea being that you see if you should go and rewrite something. If it cost 100 SWE/h to run, and it only took you a day to cut that in half, you should do it.

Numpy is competitive with optimized C/C++. So even if it's running in a Jupyter notebook, it's still going to be insanely fast.
Numpy is fine. But people write a lot of complicated code to pull JSON from somewhere, transform it in Python, and write it to parquet somewhere else, for example. JSON, the dict type and parquet are all implemented in C, but a comprehension on top of a Python iterable is just gonna be pure Python “bytecode”. It has been my experience that rewriting such things in C++, or even Go or Java is an easy way to quickly save truly incredible amounts of compute.

A team I used to work with was forced to throw away a finished Python data pipeline that took them a year to build, because it cost more to run than the combined salaries of the team. And I really think if they’d had better intuition about Python’s performance under different scenarios, they could have saved a year of effort. This is why I feel it’s worth having frank discussions about trade offs when it comes to this language.

It’s incredibly useful, but people in the community aren’t clearly told about its limitations. (Especially wrt performance, but also maintainability.)

> It has been my experience that rewriting such things in C++, or even Go or Java is an easy way to quickly save truly incredible amounts of compute.

Sure but there's a trade off, no? Go is typically 3x the code than python. And C++ is 10x the complexity easily.

There was one point back when I stopped coding C++ where one coder might not understand what another C++ coder was doing because the standard was so large.

> A team I used to work with was forced to throw away a finished Python data pipeline that took them a year to build, because it cost more to run than the combined salaries of the team.

You know, I have horror stories about C++ and Java as well. Usually that kind of blame goes to management for not understanding the issues up front. Pretty soon, I'll have slew of stories about go misusage as well.

If they had done performance testing from the start they could have saved a year. A pipeline that has not been performance tested was in no way "finished". Performance is not something that can be tackled on later. In any language...
> Numpy is competitive with optimized C/C++

Can you cite a source/example for that? I cannot imagine an optimized C program that doesn't blow python with numpy out of the water. Even a poorly written C program is likely to be 2x faster simply because it doesn't have to round trip operations from C to python and back.

I feel like this is google-able, no?

I found some metrics after 30 seconds of googling.

> the 20k syscalls is for the import call

Yes, because you're importing a library that does a lot more than just print a plot. A purpose-built Python program that just printed the plot, nothing else, would need a lot less than 20k syscalls too.

You're missing the point - importing a module in other languages takes ~100x fewer system calls. It's a rare example of Python doing something that's mostly written in pure Python, rather than invoked via an FFI, and it shows some of the inefficiency of the language laid bare. That makes it an interesting case to study.

(Of course an import call in Python does a lot more, but the end result is roughly the same as calling `dlopen` in, e.g., Swift.)

Python will do a lot under the hood that a hand-rolled C solution wouldn’t. So I wouldn’t expect the C equivalent to make the same number of syscalls as Python.
Ha!