| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nyrikki 319 days ago

The 200-100 times slower is a bit cherry picked, but use case does matter.

Typically from a user perspective, the initial starting time is either manageable or imperceptible in the cases of long running services, although there are other costs.

If you look at examples that make the above claim, they are almost always tiny toy programs where the cost of producing byte/machine code isn't easily amortized.

This quote from the post is an oversimplification too:

> But the program will then run into Amdahl's law, which says that the improvement for optimizing one part of the code is limited by the time spent in the now-optimized code

I am a huge fan of Amdahl's law, but also realize it is pessimistic and most realistic with parallelization.

It runs into serious issues when you are multiprocessing vs parallel processing due to preemption, etc .

Yes you still have the costs of abstractions etc...but in today's world, zero pages on AMD, 16k pages and a large number of mapped registers on arm, barrel shifters etc... make that much more complicated especially with C being forced into trampolines etc...

If you actually trace the CPU operations, the actual operations for 'math' are very similar.

That said modern compilers are a true wonder.

Interpreted language are often all that is necessary and sufficient. Especially when you have Internet, database and other aspects of the system that also restrict the benefits of the speedups due to...Amdahl's law.

1 comments

nu11ptr 319 days ago

I'm not so much cherry picking as I am specifically talking compute (not I/O,stdlib) performance. However, when measured for general purpose tasks, that would involve compute and things like I/O, stdlib performance, etc., Python on the whole is typically NOT 20-100x times slower for a given task. Its I/O layer is written in C like many other languages, so the moment you are waiting on I/O you have leveled the playing field. Likewise, Python has a very fast dict implementation in C, so when doing heavy map work, you also amortorize the time between the (brutally slow) compute and the very fast maps.

In summary, it depends. I am talking about compute performance, not I/O or general purpose task benchmarking. Yes, if you have a mix of compute and I/O (which admittedly is a typical use case), it isn't going to be 20-100x slower, but more likely "only" 3-20x slower. If it is nearly 100% I/O bound, it might not be any slower at all (or even faster if properly buffered). If you are doing number crunching (w/o a C lib like NumPy), your program will likely be 40-100x slower than doing it in C, and many of these aren't toy programs.

nyrikki 319 days ago

Even with compute performance it is probably closer than you expect.

Python isn't evaluated line-by-line, even in micropython, which is about the only common implementation that doesn't work in the same way.

Cython VM will produce an AST of opcodes, and binary operations just end up popping off a stack, or you can hit like pypy.

How efficiently you can keep the pipeline fed is more critical than computation costs.

     int a = 5;
     int b = 10;
     int sum = a + b;

Is compiled to:

     MOV EAX, 5
     MOV EBX, 10
     ADD EAX, EBX
     MOV [sum_variable]

In the PVM binary operations remove the top of the stack (TOS) and the second top-most stack item (TOS1) from the stack. They perform the operation, and put the result back on the stack.

That pop, pop isn't much more expensive on modern CPUs and some C compilers will use a stack depending on many factors. And even in C you have to use structs of arrays etc... depending on the use case. Stalled pipelines and fetching due to the costs is the huge difference.

It is the setup costs, GC, GIL etc... that makes python slower in many cases.

While I am not suggesting it is as slow as python, Java is also byte code, and often it's assumptions and design decisions are even better or at least nearly equal to C in the general case unless you highly optimize.

But the actual equivalent computations are almost identical, optimizations that the compilers make differ.