| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by btilly 3670 days ago

I have long offered the following advice.

If you have code that is not able to run any more in a scripting language, and it is not embarrassingly parallel, you have two choices.

1. Move to something like C++, and optimize the heck out of it. You will gain something like 1-2 orders of magnitude in performance and then hit a wall.

2. Move to a distributed architecture. You immediately lose 1-2 orders of magnitude in performance, but then can scale essentially forever.

If you expect your distributed system to need less than 100 machines, you should seriously consider option #1.

1 comments

chubot 3670 days ago

I would call those options #2 and #3. Option #1 is:

1. Parallelize on a single machine. Most scripting languages have single-threaded bottlenecks (Python, Ruby, node.js, etc.), so this means using multiple processes. xargs -P goes a long way. The coding changes are essentially a subset of what you need to distribute your program anyway.

32x or 64x speedup is nothing to sneeze at on a modern machine. The difference between 5 minutes and 5 hours usually solves your problem, practically speaking. And this means you don't have to touch every line of code, as you would if you were doing a C++ rewrite.

But also don't forget that you can often rewrite 10% of your code in C++ and keep the other 90% in Python, and get a 10x speedup. This requires fairly deep understanding of both your program and of the Python/C interface. It helps to adopt a data flow style so you are not crossing the boundary a lot. And make sure you release the GIL, and consider starting threads in C++ rather than in Python, etc.

I've also optimized an R program in C++ and gotten 125x speedup -- and that's single threaded C++; multithreaded would be another 2 orders of magnitude!!! But it also involved fixing a bunch of R performance errors, so don't underestimate that too.

In practice, any program which you actually care enough about to rewrite for speed usually has some application-level performance bugs -- i.e. slowness unrelated to the slowness of the underlying language platform.

People scale out to many machines because they don't want to rewrite every line of code. But the first step is to "scale out" by using all the cores on a single machine. In some sense, this is the best of both worlds, because you are incurring neither the overhead of distribution (serialization and networking) nor the complexity of distributed error handling.

link

btilly 3670 days ago

That is a worthwhile option, but parallelization hasn't generally offered me nearly that much of a win when I'm limited by memory access performance.

In my experience, you're best off optimizing relatively limited pieces of a system rather than big applications. And only consider the rewrite when you've looked at optimizing it in place. This means that the thing that needs to be optimized with a rewrite often has a chance of not having any giant stupid mistakes.

For example see http://bentilly.blogspot.com/2011/02/finding-related-items.h... for a case where I found a thousandfold speed increase by rewriting from SQL to C++. As much as was reasonably possible, I did not change the basic algorithm.

link

chillacy 3670 days ago

> and it is not embarrassingly parallel

Sounds like you had a problem that was "embarrassingly parallel". It's not going to be as simple if you have graph problems which require lots of IPC in python/ruby/v8. That's where moving to a language which has threading support can help.

link