Hacker News new | ask | show | jobs
by mncolinlee 4824 days ago
I worked in Cray's compiler department for seven years. If we couldn't dramatically parallelize someone's code, we couldn't sell a vector supercomputer. Period.

Automatic parallelization is very possible. The problem is tends to be less efficient. A decent developer can often do a better job than the compiler by performing manual code restructuring. The compiler cannot always determine which changes are safe without pragmas to guide it. With that said, our top compiler devs did some amazing work adding automatic parallelization to some awful code.

We inevitably sold our supercomputers because we had application experts who would manually restructure the most mission-critical code to fit cache lines and fill the vectors. Most other problems would perform quite adequately with the automatically-generated code.

What this article lacks is a description of why Erlang is more uniquely suited to writing parallel code than all the other natively parallel languages like Unified Parallel C, Fortran2008, Chapel, Golang, etc. There are so many choices and many have been around for a long, long time.

5 comments

I completely agree. As someone who works on a parallel functional language, it's very hard to sell a parallel language that isn't as fast as parallel fortran or hand-tuned C code that uses pthreads and the fastest parallel implementation of BLAS and other libraries.

The people who really care about performance are using those. The ones who don't are honestly mostly still writing code that has large constant factors of _sequential_ performance available as low-hanging fruit. Sure, they'd take free performance, but the rewrite/porting/debug costs (even in automatic parallel compilers for the same language) are at least as high as just firing up a profiler.

I'm increasingly of the opinion that if you can't win a top spot in the supercomputer performance fight, you have to have a unique application domain. Erlang's seems to be reliability. I suspect that a parallel variant of javascript that runs in every browser will end up being the next compelling parallel language, as opposed to all of us who are either inventing or attempting to resurrect languages that target x86 or GPUs.

Does/can Manticore have a such unique application domain?
We believe so! Our project leader is focused on how we can parallelize general-purpose applications easily. With more and more people writing in static functional languages that have relatively poor parallel scalability without massive program transformations (Haskell, OCaml, F#, etc.), we think there is an opportunity there.

That said, we're at a "go big or go home point." We either need to ramp up from the 1.5 grad students + 2 undergrads / year significantly or wrap things up. Getting to a point where we can be used in general-purpose projects requires a lot of work, none of which results in papers.

If I had to guess, the most probable impact is what you would expect from PL research - integration of lessons learned in other systems down the road:

- CilkPlus has a nice first pass at a work stealing algorithm, but we showed how to do it without static tuning by the programmer with lower overheads, to boot.

- Vectorization to take advantage of wider vector hardware requires transformation of data structures (e.g., array of structs to struct of arrays). We showed how to do that automatically and reason about changes in program performance.

- We have boatloads of papers - at both workshops and conferences - on what has to be done to the compiler and runtime to run efficiently on NUMA multicore systems. Right now, most fp systems do not run into these problems because they cannot scale past their own parallel overheads. Once past that bottleneck, the next one will be memory traffic, at least in our experience.

I don't say any of that to fling mud at other systems; we started our project after them all, at the start of the multicore era (2006), with the goal of investigating these specific issues without carrying along the baggage of a pre-existing sequential implementation.

There's also still a lot more to learn. I personally don't buy that that deterministic and total chaos are the two only points in the design space of program reasoning. There have to be some interesting midpoints (e.g., histories that are linearizable) that are worth investigating.

As the complexity increases one thing that starts to show is not speedup in palatalization but fault tolerance.

Debugging a non-concurrent program can be difficult, now throw in threads, shared memory, pointers and it quickly becomes a nightmare. The system could be fast, but if it crashes every week, is it useful. Often the answer is yes. But in some cases the answer it no.

There is no free lunch. Shared-nothing architecture doesn't come for free. You pay a toll in _sequential_ performance. It might or not matter to you.

> Automatic parallelization is very possible.

For numerical algorithms many and for small function scope scale I can see that. Numerical code. But the problem is (and what Joe was pointing out) is that applications and algorithm design has to be build concurrently to start with.

A compile will not re-factor your code to not access a single database and acquire a lock from 100k clients into using some eventually consistent or event-sourcing data store. It is something that has to be built from ground up.

Same thing with fault tolerance, it has to built in from ground up. Adding it later is not easy.

Erlang is not designed for parallel programming; it is designed for concurrent programming. These are two very different programming domains with different problems.

Every time someone conflates parallelism with concurrency...everyone gets very confused.

Isn't it really fair to say that it's designed for both? The way it uses immutable state and something-similar-to-s-expressions to express data make it very straightforward (or even transparent) to distribute work between multiple processes and separate computers, in addition to how it makes it practical and simple to break work into small chunks that can be interleaved easily within the same thread. It's really designed for doing both very well, wouldn't you say?
Not at all. Erlang isn't useful for modern parallel computing as we know it, which is usually done as some kind of SIMD program; say MapReduce or GPGPU using something like CUDA. The benefit doesn't just come from operating on data all at once, but these systems (or the programmer) also do a lot of work to optimize the I/O and cache characteristics of the computation.

Actor architectures are only useful for task parallelism which no one really knows how to get much out of; definitely not the close-to-linear performance benefits we can get from data parallelism. Task parallelism is much better for when you have to do multiple things at once (more efficient concurrency), not for when you want to make a sequential task faster.

Maybe this will help

http://jlouisramblings.blogspot.com/2011/07/erlangs-parallel...

and

https://news.ycombinator.com/item?id=2726661

"modern parallel computing" ... well not everything that can run parallel on multi core CPU's can run very well on a GPU.

I'm using Erlang and GPU programming each for its area of expertise. FWIW I even use both together via https://github.com/tonyrog/cl

Erlang is great at asynchronous concurrency which happens to be able to run in parallel well because of how the VM is built.

GPU's solve totally different problems

Yes, erlang is great for concurrency, GPUs are great for significant scalable parallelism. They both solve different problems, I agree, and that's my point.
SIMD is a specialized form of parallelism. It is not the only definition of the term.

It should also be clear that task parallelism (or concurrency from your perspective) has not had the benefit of billions of engineer-hours focused on improving its performance. It is within recent memory that if you wanted 20+ CPUs at your disposal, you'd have to build a cluster with explicit job management, topologically-optimized communications, and a fair amount of physical redundancy.

As many of the applications requiring low-end clusters tended to involve random numbers or floating point calculations, we also had the annoyance of minor discrepancies such as clock drift affecting the final output. This would present, for example, in a proportional percentage of video frames with conspicuously different coloration.

Task parallelism was something used to work on 20 years ago when we thought it was the solution to scaling. But then we found that the supercomputer people were right all along, that the only thing that really scales very well is data parallelism. So the focus in the last 5/10 years has been finding data parallel solutions to the problems we care about (say deep neural network training), and then mapping them to either a distributed pipeline (MapReduce) or GPU solution.

> It is within recent memory that if you wanted 20+ CPUs at your disposal, you'd have to build a cluster with explicit job management, topologically-optimized communications, and a fair amount of physical redundancy.

You are still thinking about concurrency, not parallelism. Yes, the cluster people had to think this way, they were interested in performance for processing many jobs; no the HPC people who needed performance never thought like this, they were only interested in the performance of one job.

> As many of the applications requiring low-end clusters tended to involve random numbers or floating point calculations, we also had the annoyance of minor discrepancies such as clock drift affecting the final output.

Part of the problem, I think, is that we've been confused for a long time. Our PHBs saw problems (say massive video frame processing) and saw solutions that were completely inappropriate for it (cluster computing). Its only recently that we've realized there are often other/better option (like running MapReduce on that cluster).

Isn't Joe's post specifically about parallelism and how Erlang is designed for it?
The post is about concurrency, but the word parallelism is used instead. To be fair, task level parallelism makes concurrent code run faster, but doesn't really scale if done for its own sake.
I think in this case they're relatively interchangeable terms. Rather than a SIMD vectorization of a task, you are applying a MIMD solution to various parts of a task.

You can typically get more of an immediate boost with SIMD on current hardware (especially if you can effectively cast it to GPGPUs), but MIMD is more easily applied. Almost any application can be refactored to spawn lightweight threads for many calculations without any explicit knowledge of the π-calculus.

To your point and for a well-understood example, make -j doesn't always result in faster compilations. It may if you have the ability to leverage imbalances in CPU and storage hierarchies, but you can also kill your performance with context switches (including disk seeking).

> but MIMD is more easily applied. Almost any application can be refactored to spawn lightweight threads for many calculations without any explicit knowledge of the π-calculus.

MIMD hasn't been shown to scale, and its not locking that is the problem, but I/O and memory.

> To your point and for a well-understood example, make -j doesn't always result in faster compilations. It may if you have the ability to leverage imbalances in CPU and storage hierarchies, but you can also kill your performance with context switches (including disk seeking).

When Martin Odersky began pushing actors as a solution to Scala and multi-core, this was my immediate thought: the Scala compiler is slow, can this make it go faster? It was pretty obvious after some thought that the answer was no. But then we have no way yet of casting compilation as a data-parallel task (a point in favor of task parallelism, but it doesn't help us much).

"MIMD hasn't been shown to scale, and its not locking that is the problem, but I/O and memory."

We're going to have to disagree on this one, as we have some obvious examples in favor of MIMD scaling on the TOPS 500. SIMD is just a tool for a subset of parallelizable problems.

I think the post is about parallelism. Its about how Erlang naturally scales to many cores by running in parallel. If Erlang had only concurrency, as Javascript does, it would not be solving the "right problem".
Again, even if Erlang supports hardware threading properly, it doesn't magically become a good platform for parallel computing, there is no guarantee it will scale at all.

Its funny actually: a PL person thinks the key to scalable parallelism is hardware threading; a graphics or systems person thinks the key is a well planned pipeline.

http://talks.golang.org/2012/waza.slide

"Once we have the breakdown, parallelization can fall out and correctness is easy."

Joe is saying this too. And he's saying that because Erlang is a concurrent language, parallelism (he's thinking MIMD not SIMD) is easy. He says:

> Now Erlang is (in case you missed it) a concurrent language, so Erlang programs should in principle go a lot faster when run on parallel computers, the only thing that stops this is if the Erlang programs have sequential bottlenecks.

I don't think he - nor the Go chaps - conflate concurrency and parallelism.

Fortran (especially ancient, wheezy Fortran) lent itself to supervised automatic parallelization because of its lack of dynamic arrays. It was "easy" to vectorize code at compile time when you had so much information about the runtime expectations.

We can do some of this now in most languages with hot-spot profiling, basic block analysis, selective inlining, and other innovations. However, you really can't beat low-level languages that explicitly "hint" at their execution paths.

By the same token, Cray's applications were...so...slow if you were foolish enough to run them on the expensive hardware and not the FEPs.

If the code isn't efficient, you'll run into Ahmdal's Law much more quickly. In fact, I think your comment aligns with what Joe was saying: automated paralellization is not going to happen. You will have to go through and find all your contention points, just like your application experts did.

I completely agree with your last sentence. For those of us who have dived in a way, the advantages become clear, but TFA was really just preaching to the choir.